I'm looking for an advice about structuring a data table as in title to make it efficient for querying and writing. I store information about an entity which has usual data types, numbers, short string etc. Now I need to store additional field with large amount of data (~ 30 KB) and I'm looking at two options:
add a column an nvarchar(100000) in the entity table
create separate table to store such data and link from the entity table
other factors:
each entity row will have an accompanying large text field
each accompanying text field will have at least 20 KB of data
~20% of queries against entity table also need the large field. Other queries can do without it
~95% of queries seek for single entity
I'm using an O/RM to access the data, so all the columns are pulled in (I could pick and choose by making the code look horrid)
Right now I'm leaning toward having a separate table, but it also has a bad side in that I have to remember some concerns about keeping data consistent.
It's hard to make a decision without doing a real benchmark, but this could require few days of work so I'm turning to SO for a shortcut.
We recently had this exact problem. (though it was an XML Column instead of an NVarchar(max)) but the problem is the exact same.
Our use case was to display a list of records on a web page (the first 6 columns) of the table
and then to store a tonne of additional information in the nvarchar(max) column which got displayed once your selected an individual row.
Originally a single table contained all 7 columns.
TABLE 1
INT ID (PK IDentity)
5 other columns
NVARCHAR(max)
Once we refactored it to the following we got a massive perf. boost.
TABLE 1
INT ID (PK IDentity)
5 other columns
INT FID (FK -TABLE2)
TABLE 2
FID (PK IDENTITY)
nvarchar(max)
The reason is that if the nvarchar(max) is short enough, it will be stored "in-row" but if it extends beyond the page size, then it gets stored elsewhere, and depending on a) the size of the table and record set your querying, and b) the amount of data in your nvarchar(max) this can have a pretty dramatic perf. drop.
Have a read of this link:
http://msdn.microsoft.com/en-us/library/ms189087.aspx
When a large value type or a large object data type column value is
stored in the data row, the Database Engine does not have to access a
separate page or set of pages to read or write the character or binary
string. This makes reading and writing the in-row strings about as
fast as reading or writing limited size varchar, nvarchar, or
varbinary strings. Similarly, when the values are stored off-row, the
Database Engine incurs an additional page read or write.
I'd bite the bullet now, and design your tables to store the large nvarchar(max) in a seperate table, assuming you don't need the data it contains in every select query
With regards, your comment about using an ORM. we were also using NHibernate in our situation. It's relatively easy to configure your mappings to lazy-load the related object on demand.
Well, you could start with documentation...
add a column an nvarchar(100000) in the entity table
Given the documented max size of 8000 bytes for a field and thus nvarchar(4000) being the maximum, I am interested to know how you consider this an option?
nvarchar(max) - ntext etc. would be the right thing to do.
And then you should read up on full text search, which is in SQL Serve pretty much for ages. Your ORM likely does not support it though - technology choices limiting features is typical when people - have a problem abstract things. Not something I would access with an ORM.
Related
Please, read first my previous question: T-SQL finding of exactly same values in referenced table
The main purpose of this question is to find out if this approach of storing of data is effective.
Maybe it would be better to get rid of PropertyValues table. And use additional PropertyValues nvarchar(max) column in Entities table instead of it. For example instead of
EntityId PropertyId PropertyValue
1 4 Val4
1 5 Val5
1 6 Val6
table, I could store such data in PropertyValues column: "4:Val4;5:Val5;6Val6"
As an alternative, I could store XML in PropertyValues column....
What do you think about the best approach here?
[ADDED]
Please, keep in mind:
Set of properties must be customizable
Objects will have dozens of properties (approximately from 20 to 120). Database will contain thousands of objects
[ADDED]
Data in PropertyValues table will be changed very often. Actually, I store configured products. For example, admin configures that clothes have attributes "type", "size", "color", "buttons type", "label type", "label location" etc... User will select values for these attributes from the system. So, PropertyValues data cannot be effectively cached.
You will hate yourself later if you implement a solution using multi-value attributes (i.e. 4:Val4;5:Val5;6Val6).
XML is marginally better because there are XQuery functions to help you pull out and parse the values. But the XML type is implemented as a CLR type in SQL Server and it can get extremely slow to work with.
The best solution to this problem is one like you have. Use the sql_variant type for the column if it could be any number of data types. Ideally you'd refactor this into multiple tables / entities so that the data type can be something more concrete.
I work with the similar project (web-shop generator). So every product has attribute and every attribute has set of values. It is different tables. And for all of this there are translations in several languages. (So exists additional tables for attributes and values translations).
Why we choose such solution? Because for every client there should be database with the same scheme. So such database scheme is very elastic.
So what about this solution. As always, "it depends" -))
Storage. If your value will be used often for different products, e.g. clothes where attribute "size" and values of sizes will be repeated often, your attribute/values tables will be smaller. Meanwhile, if values will be rather unique that repeatable (e.g. values for attribute "page count" for books), you will get a big enough table with values, where every value will be linked to one product.
Speed. This scheme is not weakest part of project, because here data will be changed rarely. And remember that you always can denormalize database scheme to prepare DW-like solution. You can use caching if database part will be slow too.
Elasticity This is the strongest part of solution. You can easily add/remove attributes and values and ever to move values from one attribute to another!
So answer on your question is not simple. If you prepare elastic scheme with unknown attributes and values, you should use different tables. I suggest to you remember about storing values in CSV strings. It is better to store it as XML (typed and indexed).
UPDATE
I think that PropertyValues will not change often , if comparing with user orders. But if you doubt, you should use denormalization tables or indexed views to speed up.Anyway, changing XML/CSV on large quantity of rows will have poor performance, so "separate table" solution looks good.
The SQL Customer Advisory Team (CAT) has a whitepaper written just for you: Best Practices for Semantic Data Modeling for Performance and Scalability. It goes through the common pitfalls of EAV modeling and recommends how to design a scalable EAV solution.
Given the table tblProject. This has a myriad of properties. For example, width, height etc etc. Dozens of them.
I'm adding a new module which lets you specify settings for your project for mobile devices. This is a 1-1 relationship, so all the mobile settings should be stored in tblProject. However, the list is getting huge, there will be some ambiguity amongst properties (IE, I will have to prefix all mobile fields with MOBILE so that Mobile_width isn't confused with width).
How bad is it to denormalise and store the mobile settings in another table? Or a better way to store the settings? The properties and becoming unwieldly and hard to modify/find in the table.
I want to respond to #Alexander Sobolev's suggestion and provide my own.
#Alexander Sobolev suggests an EAV model. This trades maximum flexibility, for poor performance and complexity as you need to join multiple times to get all values for an entity. The way you typically work around those issues is keeping all the entity meta data in memory (i.e. tblProperties) so you don't join to it at runtime. And, denormalize the values (i.e. tblProjectProperties) as a CLOB (i.e. XML) off the root table. Thus you only use the values table for querying and sorting, but not to actually retrieve the data. Also you usually end up caching the actual entities by ID as well so you don't have the expense of deserialization each time. Issues you run into the are cache invalidation of the entities and their meta data. So overall a non trivial approach.
What I would do instead is create a separate table, perhaps more than one depending on your data, with a discriminator/type column:
create table properties (
root_id int,
type_id int,
height int
width int
...etc...
)
Make the unique a combination of root_id and type_id, where type_id would be representative of mobile for instance - assuming a separate lookup table in my example.
There is nothing bad in storing mobile section in other table. This could even carry some economy, this depends on how much this information is used.
You can store in another table or use even more complicated version with three tables. One is your tblProject, one is tblProperties and one is tblProjectProperties.
create table tblProperties (
id int autoincrement(1,1) not null,
prop_name nvarchar(32),
prop_description nvarchar(1024)
)
create table tblProjectProperties
(
ProjectUid int not null,
PropertyUid int not null,
PropertyValue nvarchar(256)
)
with foreign key tblProjectProperties. ProjectUid -> tblProject.uid
and foreign key tblProjectProperties.propertyUid -> tblProperties.id
Thing is if you have different types of projects wich use different properties, you have no need to store all these unused null and store only properties you really need for given project. Above schema gives you some flexibility. You can create some views for different project types and use it to avoid too much joins in user selects.
I've read a lot about star-schema's, about fact/deminsion tables, select statements to quickly report data, however the matter of data entry into a star-schema seems aloof to me. How does one "theoretically" enter data into a star-schema db? while maintaining the fact table. Is a series of INSERT INTO statement within giant stored proc with 20 params my only option (and how to populate the fact table).
Many thanks.
Start with dimensions first -- one by one. Use ECCD (Extract, Clean, Conform, Deliver) approach.
Make sure that each dimension has a BusinessKey that uniquely identifies the "object" that a dimension row describes -- like email for a person.
With dimensions loaded, prepare key-lookup pipeline. In general, for each each dimension table you can prepare a key lookup table (BusinessKey, PrimaryKey). Some designers choose to lookup the dimension table directly, but the key-lookup can be often easily cached into memory which results in faster fact loading.
Use ECCD for fact data too. The ECC part happens in the staging area, you can choose (helper) tables or flat files for each step of the ECC, as you prefer.
While delivering fact tables, replace each BusinessKey in the fact row with the matching PrimaryKey that you get from a key-lookup table. Once all BusinessKeys are replaced with their matching PrimaryKeys, insert the row into the fact table.
Do not waste you time, use ETL tool. You can download Pentaho Kettle (community edition) for free -- it has everything one needs to achieve this.
You typically do not insert data into a star schema in the same way you might into a normal form - i.e. with a stored procedure which inserts/updated all the appropriate tables within a single transaction. Remember that the star schema is typically a read-only denormalized model of data - it is (rarely) treated transactionally, and is typically loaded from data that is already denormalized flat - usually one flat file per star.
As Damir points out, typically, you load all the dimensions (handle the slowly changing etc), then load the facts, joining to the appropriate current dimensions to find the dimension IDs (using the business keys).
I am using SQL Server 2005 Express and Visual Studio 2008.
I have a database which has a table with 400 Columns. Things were (just about manageable) until I had to perform bi-directional sync between several databases.
I am wondering what arguments are for and against using 400 column database or 40 table database are?
The table in not normalised and comprises of mainly nvarchar(64) columns and some TEXT columns. (there are no datatypes as it was converted from text files).
There is one other table that links to this table and is a 1-1 relationship (i.e one entry relates to one entry in the 400 column table).
The table is a list files that contained parameters that are "plugged" into a application.
I look forward to your replies.
Thank you
Based on your process description I would start with something like this. The model is simplified, does not capture history, etc -- but, it is a good starting point. Note: parameter = property.
- Setup is a collection of properties. One setup can have many properties, one property belongs to one setup only.
- Machine can have many setups, one setup belongs to one machine only.
- Property is of a specific type (temperature, run time, spindle speed), there can be many properties of a certain type.
- Measurement and trait are types of properties. Measurement is a numeric property, like speed. Trait is a descriptive property, like color or some text.
For having a wide table:
Quick to report on as it's presumably denormalized and so no joins are needed.
Easy to understand for end-consumers as they don't need to hold a data model in their heads.
Against having a wide table:
Probably need to have multiple composite indexes to get good query performance
More difficult to maintain data consistency i.e. need to update multiple rows when data changes if that data is on multiple rows
As you're having to update multiple rows and maintain multiple indexes, concurrent performance for updates may become an issue as locks escalate.
You might end up with records with loads of nulls in columns if the attribute isn't relevant to the entity on that row which can make handling results awkward.
If lazy developers do a SELECT * from the table you end up dragging loads of data across the network, so you generally have to maintain suitable subset views.
So it all really depends on what you're doing. If the main purpose of the table is OLAP reporting and updates are infrequent and affect few rows then perhaps a wide, denormalized table is the right thing to have. In an OLTP environment then it's probably not and you should prefer narrower tables. (I generally design in 3NF and then denormalize for query performance as I go along.)
You could always take the approach of normalizing and providing a wide-view for readers if that's what they want to see.
Without knowing more about the situation it's not really possible to say more about the pros and cons in your particular circumstance.
Edit:
Given what you've said in your comments, have you considered just having a long & skinny name=value pair table so you'd just have UserId, PropertyName, PropertyValue columns? You might want to add in some other meta-attributes into it too; timestamp, version, or whatever. SQL Server is quite efficient at handling these sorts of tables so don't discount a simple solution like this out-of-hand.
An application that I'm facing at a customer, looks like this:
it allows end users to enter "materials".
To those materials, they can append any number of "properties".
Properties can have a any value of type: decimal, int, dateTime and varchar (length varying from 5 characters to large chunks of text),
Essentially, the Schema looks like this:
Materials
MaterialID int not null PK
MaterialName varchar(100) not null
Properties
PropertyID
PropertyName varchar(100)
MaterialsProperties
MaterialID
PropertyID
PropertyValue varchar(3000)
An essential feature of the application is the search functionality:
end users can search materials by entering queries like:
[property] inspectionDate > [DateTimeValue]
[property] serialNr = 35465488
Guess how this performs over the MaterialsProperties-table with nearly 2 million records in it.
Database was initially created under SQL Server 2000 and later on migrated to SQL Server 2005
How can this be done better?
You could consider separating your MaterialsProperties table by typel e.g. into IntMaterialProperties, CharMaterialProperties, etc. This would:
Partition your data.
Allow for potentially faster look-ups for integer (or other numeric) type look-ups.
Potentially reduce storage costs.
You could also introduce a Type column to Properties, which you could use to determine which MaterialProperties table to query. The column could also be used to validate the user's input is of the correct type, eliminating the need to query given "bad" input.
Since users can enter their own property names, i guess every query is going to involve a scan of the properties table (in your example i need to find the propertyid of [inspectionDate]). If the properties table is large, your join would also take a long time. You could try and optimize by denormalizing and storing name with propertyID. This would be a denaormalized column in the MaterialsProperties table.
You could try adding a property type (int, char etc) to the materialsproperty table and partition the table on the type.
Look at Object Relational Mapping/Entity Attribute Value Model techniques for query optimization.
Since you already have a lot of data (2 million records) do some data mining as see if there are repeating groups of properties for many materials. You can them put them in one schema and the rest as the EAV table. Look here for details: http://portal.acm.org/citation.cfm?id=509015&dl=GUIDE&coll=GUIDE&CFID=49465839&CFTOKEN=33971901