schema for storing different varchar fields over time? - sql-server

This app I'm working on needs to store some meta data fields about an entity. The problem is that we can already foresee that these fields are going to change a lot in the future. Right now every entity's property is translated to one column in the entity table, but altering table columns later down the road will be costly and error-prone right?
Should I go for something like this (key-value store) instead?
MetaDataField
-----
metaDataFieldID (PK), name
FieldValue
----------
EntityID (PK, FK), metaDataFieldID (PK, FK), value [varchar(255)]
p.s. I also thought of using XML on SQL Server 05+. After talking to some ppl, seems like it is not a viable solution 'cause it will be too slow for doing certain query for reporting purposes.

You're right, you don't want to go changing your data schema any time a new parameter comes up!
I've seen two ways of doing something like this. One, just have a "meta" text field, and format the value to define both the parameter and the value. Joomla! does this, for example, to track custom article properties. It looks like this:
ProductTable
id name meta
--------------------------------------------------------------------------
1 prod-a title:'a product title',desc:'a short description'
2 prod-b title:'second product',desc:'n/a'
3 prod-c title:'3rd product',desc:'please choose sm med or large'
Another way of handling this is to use additional tables, like this:
ProductTable
product_id name
-----------------------
1 prod-a
2 prod-b
3 prod-c
MetaParametersTable
meta_id name
--------------------
1 title
2 desc
ProductMetaMapping
product_id meta_id value
-------------------------------------
1 1 a product title
1 2 a short description
2 1 second product
2 2 n/a
3 1 3rd product
3 2 please choose sm med or large
In this case, a query will need to join the tables, but you can optimize the tables better, can query for independent meta without returning all parameters, etc.
Choosing between them will depend on complexity, whether data rows ever need to have differing meta, and how the data will be consumed.

The Key Value table is a good idea and it works much faster than the SQL Server 2005 XML indexes. I started the same type of solution with XML in a project and had to change it to a indexed Key Value table to gain performance. I think SQL Server 2008 XML Indexes are faster, but have not tried them yet.

The XML speed only factors in depending on the size of the data going into the xml column. We had a project that stuffed data into and processed data from an xml column. It was very fast.. until you hit around 64kb. 63KB and less took milliseconds to get the data out or insert into. 64KB and the operations jumped to a full minute. Go figure.
Other than that the main issue we had was complexity. Working with xml data in sql server is not for the faint of heart.
Regardless, your best bet is to have a table of name / value pairs tied to the entity in question. Then it's easy to support having entities with either different properties or dynamically adding / removing properties. This too has it's caveats. For example, if you have more than say 10 properties, then it will be much faster to do pivots in code.

There is also a pattern for this to consider -- called the observation pattern.
See similar questions/answers: one, two, three.
The pattern is described in Martin Fowler's book Analysis Patterns, essentially it is an OO pattern, but can be done in DB schema too.

"altering table columns later down the road will be costly and error-prone right?"
A "table column", as you name it, has exactly two properties : its name and its data type. Therefore, "altering a table column" can refer only to two things : altering the name or altering the data type.
Wanting to alter the name is indeed a costly and error-prone operation, but fortunately there should never be a genuine business need for it. If a certain established column seems somewhat inappropriate, with afterthought, and "it might have been given a better name", then it is still not the case that the business incurs losses from that fact! Just stick with the old name, even if with afterthought, it was poorly chosen.
Wanting to alter the data type is indeed a costly operation, susceptible to breaking business operations that were running smoothly, but fortunately it is quite rare that a user comes round to tell you that "hey, I know I told you this attribute had to be a Date, but guess what, I was wrong, it has to be a Float.". And other changes of the same nature, but more likely to occur (e.g. from shortint to integer or so), can be avoided by being cautious when defining the database.
Other types of database changes (e.g. adding a new column) are usually not that dangerous and/or disruptive.
So don't let yourself be scared by those vague sloganesque phrases such as "changing a database is expensive and dangerous". They usually come from ignorants who know too little about database management to be involved in that particular field of our profession anyway.
Maintaining queries, constraints and constraint enforcement on an EAV database is very likely to turn out to be thousands of times more expensive than "regular" database structure changes.

Related

Single Big SQL Server lookup table

I have a SQL Server 2008 database with a snowflake-style schema, so lots of different lookup tables, like Language, Countries, States, Status, etc. All these lookup table have almost identical structures: Two columns, Code and Decode. My project manager would like all of these different tables to be one BIG table, so I would need another column, say CodeCategory, and my primary key columns for this big table would be CodeCategory and Code. The problem is that for any of the tables that have the actual code (say Language Code), I cannot establish a foreign key relationship into this big decode table, as the CodeCategory would not be in the fact table, just the code. And codes by themselves will not be unique (they will be within a CodeCategory), so I cannot make an FK from just the fact table code field into the Big lookup table Code field.
So am I missing something, or is this impossible to do and still be able to do FKs in the related tables? I wish I could do this: have a FK where one of the columns I was matching to in the lookup table would match to a string constant. Like this (I know this is impossible but it gives you an idea what I want to do):
ALTER TABLE [dbo].[Users] WITH CHECK ADD CONSTRAINT [FK_User_AppCodes]
FOREIGN KEY('Language', [LanguageCode])
REFERENCES [dbo].[AppCodes] ([AppCodeCategory], [AppCode])
The above does not work, but if it did I would have the FK I need. Where I have the string 'Language', is there any way in T-SQL to substitute the table name from code instead?
I absolutely need the FKs so, if nothing like this is possible, then I will have to stick with my may little lookup tables. any assistance would be appreciated.
Brian
It is not impossible to accomplish this, but it is impossible to accomplish this and not hurt the system on several levels.
While a single lookup table (as has been pointed out already) is a truly horrible idea, I will say that this pattern does not require a single field PK or that it be auto-generated. It requires a composite PK comprised of ([AppCodeCategory], [AppCode]) and then BOTH fields need to be present in the fact table that would have a composite FK of both fields back to the PK. Again, this is not an endorsement of this particular end-goal, just a technical note that it is possible to have composite PKs and FKs in other, more appropriate scenarios.
The main problem with this type of approach to constants is that each constant is truly its own thing: Languages, Countries, States, Statii, etc are all completely separate entities. While the structure of them in the database is the same (as of today), the data within that structure does not represent the same things. You would be locked into a model that either disallows from adding additional lookup fields later (such as ISO codes for Language and Country but not the others, or something related to States that is not applicable to the others), or would require adding NULLable fields with no way to know which Category/ies they applied to (have fun debugging issues related to that and/or explaining to the new person -- who has been there for 2 days and is tasked with writing a new report -- that the 3 digit ISO Country Code does not apply to the "Deleted" status).
This approach also requires that you maintain an arbitrary "Category" field in all related tables. And that is per lookup. So if you have CountryCode, LanguageCode, and StateCode in the fact table, each of those FKs gets a matching CategoryID field, so now that is 6 fields instead of 3. Even if you were able to use TINYINT for CategoryID, if your fact table has even 200 million rows, then those three extra 1 byte fields now take up 600 MB, which adversely affects performance. And let's not forget that backups will take longer and take up more space, but disk is cheap, right? Oh, and if backups take longer, then restores also take longer, right? Oh, but the table has closer to 1 billion rows? Even better ;-).
While this approach looks maybe "cleaner" or "easier" now, it is actually more costly in the long run, especially in terms of wasted developer time, as you (and/or others) in the future try to work around issues related to this poor design choice.
Has anyone even asked your project manager what the intended benefit of this is? It is a reasonable question if you are going to spend some amount of hours making changes to the system that there be a stated benefit for that time spent. It certainly does not make interacting with the data any easier, and in fact will make it harder, especially if you choose a string for the "Category" instead of a TINYINT or maybe SMALLINT.
If your PM still presses for this change, then it should be required, as part of that project, to also change any enums in the app code accordingly so that they match what is in the database. Since the database is having its values munged together, you can accomplish that in C# (assuming your app code is in C#, if not then translate to whatever is appropriate) by setting the enum values explicitly with a pattern of the first X digits are the "category" and the remaining Y digits are the "value". For example:
Assume the "Country" category == 1 and the "Language" catagory == 2, you could do:
enum AppCodes
{
// Countries
United States = 1000001,
Canada = 1000002,
Somewhere Else = 1000003,
// Languages
EnglishUS = 2000001,
EnglishUK = 2000002,
French = 2000003
};
Absurd? Completely. But also analogous to the request of merging all lookup tables into a single table. What's good for the goose is good for the gander, right?
Is this being suggested so you can minimise the number of admin screens you need for CRUD operations on your standing data? I've been here before and decided it was better/safer/easier to build a generic screen which used metadata to decide what table to extract from/write to. It was a bit more work to build but kept the database schema 'correct'.
All the standing data tables had the same basic structure, they were mainly for dropdown population with occasional additional fields for business rule purposes.

Why can't we put all masters in one Master table in Database?

We are designing a small database in MS Access 2010 and we have like 3 master attributes
Lets take for example we have Country, State and Tastes. Instead of designing master table for each attribute, we have come up with one table like below
ID Value Attribute
1 USA Country
2 UK Country
3 Illionis State
4 Wisconsin State
5 Sweet Taste
6 Sour Taste
We are using self joins and getting what is required.
Does anyone think that, it is not a good database design, if yes please explain
Reasons against:
1) Extra storage space to store a field indicating what type it is (cancelled out by the primary keys on each table when having multiple tables, but then you'll need to store the type as an (small) integer type, not a string type).
2) Extra storage space for fields that are not applicable to certain types (N/A if the above is not just an example, and there won't be more fields, but then I'm questioning the rest of your DB design, and extensibility is always worth a consideration).
3) Reduced performance to select the applicable rows.
4) An index would obviously be required on Attribute (otherwise (3) is a performance killer), so - reduced performance on update and delete statements.
5) Bad database design - don't combine concepts that don't belong together
EDIT:
6) Database integrity - what stops you from just inserting invalid data into the Attribute field. Admittedly, you can have another table with attributes and make Attribute a foreign key to that table, which is a bit messy and confusing to figure out what's going on sometimes.
7) Foreign keys - doing this will just be a mess, not too mention you can't enforce database integrity and likely speed implications.
8) Visualization - any table diagrams will have to be manually drawn or edited because an automatic generating tool (most likely) won't be able to account for this type of design.
If I need to get a list of states by country, how do I do that? With your design, you can't do that, other than by adding an additional table. If you split into entity types, e.g. Country, AdministrativeDivision and Taste, you can store the appropriate attributes per entity, instead of complicated join tables. The resulting SQL is easier to read and debug.
There is really no reason to attempt to "optimize" by minimizing the number of tables. Any modern database engine will not suffer a performance penalty from additional tables. Your design may in fact trigger a performance penalty. Depending on how many different entities you try to jam into that table, you may end up making it so large that the entire table can't fit into memory, thus forcing the database to page from disk when performing selects and joins on this table. A good rule of thumb might be if you can't make a reasonable guess about what query plan your database might use to get the data you are requesting, just follow accepted SQL best practices.
There is one situation I can think of where this design could be acceptable, and that's if you need to provide a store for users to add their own categories and values at runtime.

Best approach to store data which attributes can vary

Please, read first my previous question: T-SQL finding of exactly same values in referenced table
The main purpose of this question is to find out if this approach of storing of data is effective.
Maybe it would be better to get rid of PropertyValues table. And use additional PropertyValues nvarchar(max) column in Entities table instead of it. For example instead of
EntityId PropertyId PropertyValue
1 4 Val4
1 5 Val5
1 6 Val6
table, I could store such data in PropertyValues column: "4:Val4;5:Val5;6Val6"
As an alternative, I could store XML in PropertyValues column....
What do you think about the best approach here?
[ADDED]
Please, keep in mind:
Set of properties must be customizable
Objects will have dozens of properties (approximately from 20 to 120). Database will contain thousands of objects
[ADDED]
Data in PropertyValues table will be changed very often. Actually, I store configured products. For example, admin configures that clothes have attributes "type", "size", "color", "buttons type", "label type", "label location" etc... User will select values for these attributes from the system. So, PropertyValues data cannot be effectively cached.
You will hate yourself later if you implement a solution using multi-value attributes (i.e. 4:Val4;5:Val5;6Val6).
XML is marginally better because there are XQuery functions to help you pull out and parse the values. But the XML type is implemented as a CLR type in SQL Server and it can get extremely slow to work with.
The best solution to this problem is one like you have. Use the sql_variant type for the column if it could be any number of data types. Ideally you'd refactor this into multiple tables / entities so that the data type can be something more concrete.
I work with the similar project (web-shop generator). So every product has attribute and every attribute has set of values. It is different tables. And for all of this there are translations in several languages. (So exists additional tables for attributes and values translations).
Why we choose such solution? Because for every client there should be database with the same scheme. So such database scheme is very elastic.
So what about this solution. As always, "it depends" -))
Storage. If your value will be used often for different products, e.g. clothes where attribute "size" and values of sizes will be repeated often, your attribute/values tables will be smaller. Meanwhile, if values will be rather unique that repeatable (e.g. values for attribute "page count" for books), you will get a big enough table with values, where every value will be linked to one product.
Speed. This scheme is not weakest part of project, because here data will be changed rarely. And remember that you always can denormalize database scheme to prepare DW-like solution. You can use caching if database part will be slow too.
Elasticity This is the strongest part of solution. You can easily add/remove attributes and values and ever to move values from one attribute to another!
So answer on your question is not simple. If you prepare elastic scheme with unknown attributes and values, you should use different tables. I suggest to you remember about storing values in CSV strings. It is better to store it as XML (typed and indexed).
UPDATE
I think that PropertyValues will not change often , if comparing with user orders. But if you doubt, you should use denormalization tables or indexed views to speed up.Anyway, changing XML/CSV on large quantity of rows will have poor performance, so "separate table" solution looks good.
The SQL Customer Advisory Team (CAT) has a whitepaper written just for you: Best Practices for Semantic Data Modeling for Performance and Scalability. It goes through the common pitfalls of EAV modeling and recommends how to design a scalable EAV solution.

Storing Preferences/One-to-One Relationships in Database

What is the best way to store settings for certain objects in my database?
Method one: Using a single table
Table: Company {CompanyID, CompanyName, AutoEmail, AutoEmailAddress, AutoPrint, AutoPrintPrinter}
Method two: Using two tables
Table Company {CompanyID, COmpanyName}
Table2 CompanySettings{CompanyID, utoEmail, AutoEmailAddress, AutoPrint, AutoPrintPrinter}
I would take things a step further...
Table 1 - Company
CompanyID (int)
CompanyName (string)
Example
CompanyID 1
CompanyName "Swift Point"
Table 2 - Contact Types
ContactTypeID (int)
ContactType (string)
Example
ContactTypeID 1
ContactType "AutoEmail"
Table 3 Company Contact
CompanyID (int)
ContactTypeID (int)
Addressing (string)
Example
CompanyID 1
ContactTypeID 1
Addressing "name#address.blah"
This solution gives you extensibility as you won't need to add columns to cope with new contact types in the future.
SELECT
[company].CompanyID,
[company].CompanyName,
[contacttype].ContactTypeID,
[contacttype].ContactType,
[companycontact].Addressing
FROM
[company]
INNER JOIN
[companycontact] ON [companycontact].CompanyID = [company].CompanyID
INNER JOIN
[contacttype] ON [contacttype].ContactTypeID = [companycontact].ContactTypeID
This would give you multiple rows for each company. A row for "AutoEmail" a row for "AutoPrint" and maybe in the future a row for "ManualEmail", "AutoFax" or even "AutoTeleport".
Response to HLEM.
Yes, this is indeed the EAV model. It is useful where you want to have an extensible list of attributes with similar data. In this case, varying methods of contact with a string that represents the "address" of the contact.
If you didn't want to use the EAV model, you should next consider relational tables, rather than storing the data in flat tables. This is because this data will almost certainly extend.
Neither EAV model nor the relational model significantly slow queries. Joins are actually very fast, compared with (for example) a sort. Returning a record for a company with all of its associated contact types, or indeed a specific contact type would be very fast. I am working on a financial MS SQL database with millions of rows and similar data models and have no problem returning significant amounts of data in sub-second timings.
In terms of complexity, this isn't the most technical design in terms of database modelling and the concept of joining tables is most definitely below what I would consider to be "intermediate" level database development.
I would consider if you need one or two tables based onthe following criteria:
First are you close the the record storage limit, then two tables definitely.
Second will you usually be querying the information you plan to put inthe second table most of the time you query the first table? Then one table might make more sense. If you usually do not need the extended information, a separate ( and less wide) table should improve performance on the main data queries.
Third, how strong a possibility is it that you will ever need multiple values? If it is one to one nopw, but something like email address or phone number that has a strong possibility of morphing into multiple rows, go ahead and make it a related table. If you know there is no chance or only a small chance, then it is OK to keep it one assuming the table isn't too wide.
EAV tables look like they are nice and will save futue work, but in reality they don't. Genreally if you need to add another type, you need to do future work to adjust quesries etc. Writing a script to add a column takes all of five minutes, the other work will need to be there regarless of the structure. EAV tables are also very hard to query when you don;t know how many records you wil need to pull becasue normally you want them on one line and will get the information by joining to the same table multiple times. This causes performance problmes and locking especially if this table is central to your design. Don't use this method.
It depends if you will ever need more information about a company. If you notice yourself adding fields like companyphonenumber1 companyphonenumber2, etc etc. Then method 2 is better as you would seperate your entities and just reference a company id. If you do not plan to make these changes and you feel that this table will never change then method 1 is fine.
Usually, if you don't have data duplication then a single table is fine.
In your case you don't so the first method is OK.
I use one table if I estimate the data from the "second" table will be used in more than 50% of my queries. Use two tables if I need multiple copies of the data (i.e. multiple phone numbers, email addresses, etc)

SQL Server: One Table with 400 Columns or 40 Tables with 10 Columns?

I am using SQL Server 2005 Express and Visual Studio 2008.
I have a database which has a table with 400 Columns. Things were (just about manageable) until I had to perform bi-directional sync between several databases.
I am wondering what arguments are for and against using 400 column database or 40 table database are?
The table in not normalised and comprises of mainly nvarchar(64) columns and some TEXT columns. (there are no datatypes as it was converted from text files).
There is one other table that links to this table and is a 1-1 relationship (i.e one entry relates to one entry in the 400 column table).
The table is a list files that contained parameters that are "plugged" into a application.
I look forward to your replies.
Thank you
Based on your process description I would start with something like this. The model is simplified, does not capture history, etc -- but, it is a good starting point. Note: parameter = property.
- Setup is a collection of properties. One setup can have many properties, one property belongs to one setup only.
- Machine can have many setups, one setup belongs to one machine only.
- Property is of a specific type (temperature, run time, spindle speed), there can be many properties of a certain type.
- Measurement and trait are types of properties. Measurement is a numeric property, like speed. Trait is a descriptive property, like color or some text.
For having a wide table:
Quick to report on as it's presumably denormalized and so no joins are needed.
Easy to understand for end-consumers as they don't need to hold a data model in their heads.
Against having a wide table:
Probably need to have multiple composite indexes to get good query performance
More difficult to maintain data consistency i.e. need to update multiple rows when data changes if that data is on multiple rows
As you're having to update multiple rows and maintain multiple indexes, concurrent performance for updates may become an issue as locks escalate.
You might end up with records with loads of nulls in columns if the attribute isn't relevant to the entity on that row which can make handling results awkward.
If lazy developers do a SELECT * from the table you end up dragging loads of data across the network, so you generally have to maintain suitable subset views.
So it all really depends on what you're doing. If the main purpose of the table is OLAP reporting and updates are infrequent and affect few rows then perhaps a wide, denormalized table is the right thing to have. In an OLTP environment then it's probably not and you should prefer narrower tables. (I generally design in 3NF and then denormalize for query performance as I go along.)
You could always take the approach of normalizing and providing a wide-view for readers if that's what they want to see.
Without knowing more about the situation it's not really possible to say more about the pros and cons in your particular circumstance.
Edit:
Given what you've said in your comments, have you considered just having a long & skinny name=value pair table so you'd just have UserId, PropertyName, PropertyValue columns? You might want to add in some other meta-attributes into it too; timestamp, version, or whatever. SQL Server is quite efficient at handling these sorts of tables so don't discount a simple solution like this out-of-hand.

Resources