Approach to generic database design

Approach to generic database design - sql-server

An application that I'm facing at a customer, looks like this:
it allows end users to enter "materials".
To those materials, they can append any number of "properties".
Properties can have a any value of type: decimal, int, dateTime and varchar (length varying from 5 characters to large chunks of text),
Essentially, the Schema looks like this:
Materials
MaterialID int not null PK
MaterialName varchar(100) not null
Properties
PropertyID
PropertyName varchar(100)
MaterialsProperties
MaterialID
PropertyID
PropertyValue varchar(3000)
An essential feature of the application is the search functionality:
end users can search materials by entering queries like:
[property] inspectionDate > [DateTimeValue]
[property] serialNr = 35465488
Guess how this performs over the MaterialsProperties-table with nearly 2 million records in it.
Database was initially created under SQL Server 2000 and later on migrated to SQL Server 2005
How can this be done better?

You could consider separating your MaterialsProperties table by typel e.g. into IntMaterialProperties, CharMaterialProperties, etc. This would:
Partition your data.
Allow for potentially faster look-ups for integer (or other numeric) type look-ups.
Potentially reduce storage costs.
You could also introduce a Type column to Properties, which you could use to determine which MaterialProperties table to query. The column could also be used to validate the user's input is of the correct type, eliminating the need to query given "bad" input.

Since users can enter their own property names, i guess every query is going to involve a scan of the properties table (in your example i need to find the propertyid of [inspectionDate]). If the properties table is large, your join would also take a long time. You could try and optimize by denormalizing and storing name with propertyID. This would be a denaormalized column in the MaterialsProperties table.
You could try adding a property type (int, char etc) to the materialsproperty table and partition the table on the type.
Look at Object Relational Mapping/Entity Attribute Value Model techniques for query optimization.
Since you already have a lot of data (2 million records) do some data mining as see if there are repeating groups of properties for many materials. You can them put them in one schema and the rest as the EAV table. Look here for details: http://portal.acm.org/citation.cfm?id=509015&dl=GUIDE&coll=GUIDE&CFID=49465839&CFTOKEN=33971901

Related

NoSQL DBMS to query and intersect data by sparse properties

I am in a research phase for a project, where the subject is to identify/select objects (e.g. email address or phone number) by querying for any number of sparsely populated properties associated with each of the objects.
First, I was thinking of Cassandra, with something like:
CREATE TABLE data (
property text,
property_value text,
email_id int,
PRIMARY KEY (property, property_value)
) WITH COMPACT STORAGE;
Where it is then easy to retrieve email_id for given property value.
But the need is to query the data by multiple properties and values. I know it is possible to do it client-side by intersecting, but with possibly millions of rows to intersect, it does not seem very efficient to me.
What is the right approach and technology to execute this kind of queries?

Even if C* has good support for sparse data tables (you can add columns dynamically), it seems to me that your query model doesn't fit well. This could be a good fit for relational databases instead.

Structuring database table with large text field

I'm looking for an advice about structuring a data table as in title to make it efficient for querying and writing. I store information about an entity which has usual data types, numbers, short string etc. Now I need to store additional field with large amount of data (~ 30 KB) and I'm looking at two options:
add a column an nvarchar(100000) in the entity table
create separate table to store such data and link from the entity table
other factors:
each entity row will have an accompanying large text field
each accompanying text field will have at least 20 KB of data
~20% of queries against entity table also need the large field. Other queries can do without it
~95% of queries seek for single entity
I'm using an O/RM to access the data, so all the columns are pulled in (I could pick and choose by making the code look horrid)
Right now I'm leaning toward having a separate table, but it also has a bad side in that I have to remember some concerns about keeping data consistent.
It's hard to make a decision without doing a real benchmark, but this could require few days of work so I'm turning to SO for a shortcut.

We recently had this exact problem. (though it was an XML Column instead of an NVarchar(max)) but the problem is the exact same.
Our use case was to display a list of records on a web page (the first 6 columns) of the table
and then to store a tonne of additional information in the nvarchar(max) column which got displayed once your selected an individual row.
Originally a single table contained all 7 columns.
TABLE 1
INT ID (PK IDentity)
5 other columns
NVARCHAR(max)
Once we refactored it to the following we got a massive perf. boost.
TABLE 1
INT ID (PK IDentity)
5 other columns
INT FID (FK -TABLE2)
TABLE 2
FID (PK IDENTITY)
nvarchar(max)
The reason is that if the nvarchar(max) is short enough, it will be stored "in-row" but if it extends beyond the page size, then it gets stored elsewhere, and depending on a) the size of the table and record set your querying, and b) the amount of data in your nvarchar(max) this can have a pretty dramatic perf. drop.
Have a read of this link:
http://msdn.microsoft.com/en-us/library/ms189087.aspx
When a large value type or a large object data type column value is
stored in the data row, the Database Engine does not have to access a
separate page or set of pages to read or write the character or binary
string. This makes reading and writing the in-row strings about as
fast as reading or writing limited size varchar, nvarchar, or
varbinary strings. Similarly, when the values are stored off-row, the
Database Engine incurs an additional page read or write.
I'd bite the bullet now, and design your tables to store the large nvarchar(max) in a seperate table, assuming you don't need the data it contains in every select query
With regards, your comment about using an ORM. we were also using NHibernate in our situation. It's relatively easy to configure your mappings to lazy-load the related object on demand.

Well, you could start with documentation...
add a column an nvarchar(100000) in the entity table
Given the documented max size of 8000 bytes for a field and thus nvarchar(4000) being the maximum, I am interested to know how you consider this an option?
nvarchar(max) - ntext etc. would be the right thing to do.
And then you should read up on full text search, which is in SQL Serve pretty much for ages. Your ORM likely does not support it though - technology choices limiting features is typical when people - have a problem abstract things. Not something I would access with an ORM.

Organizing database tables - large number of properties

I have a database that stores some users in it. Each user has its account settings, privacy settings and lots of other properties to set. The number of those properties started to grow and I could end up with 30 properties or so.
Till now, I used to keep it in "UserInfo" table having User and UserInfo related as One-To-Many (keeping a log of all changes). Putting it in a single "UserInfo" table doesn't sound nice and, at least in the database model, it would look messy. What's the solution?
Separating privacy settings, account settings and other "groups" of settings in separate tables and have 1-1 relations between UserInfo and each group of settings table is one solution, but would that be too slow (or much slower) when retrieving the data? I guess all data would not be presented on a single page at the same moment. So maybe having one-to-many relationships to each table is a solution too (keeping log of each group separately)?

If it's only 30 properties, I'd recommend just creating 30 columns. That's not too much for a modern database to handle.
But I would guess that if you ahve 30 properties today, you will continue to invent new properties as time goes on, and the number of columns will keep growing. Restructuring your table to add columns every day may become time-consuming as you get lots of rows.
For an alternative solution check out this blog for a nifty solution for storing lots of dynamic attributes in a "schemaless" way: How FriendFeed Uses MySQL.
Basically, collect all the properties into some format and store it in a single TEXT column. The format is semi-structured, that is your application can separate the properties if needed but you can also add more at any time, or even have different properties per row. XML or YAML or JSON are example formats, or some object serialization format supported by your application code language.
CREATE TABLE Users (
user_id SERIAL PRIMARY KEY,
user_proerties TEXT
);
This makes it hard to search for a given value in a given property. So in addition to the TEXT column, create an auxiliary table for each property you want to be searchable, with two columns: values of the given property, and a foreign key back to the main table where that particular value is found. Now you have can index the column so lookups are quick.
CREATE TABLE UserBirthdate (
user_id BIGINT UNSIGNED PRIMARY KEY,
birthdate DATE NOT NULL,
FOREIGN KEY (user_id) REFERENCES Users(user_id),
KEY (birthdate)
);
SELECT u.* FROM Users AS u INNER JOIN UserBirthdate b USING (user_id)
WHERE b.birthdate = '2001-01-01';
This means as you insert or update a row in Users, you also need to insert or update into each of your auxiliary tables, to keep it in sync with your data. This could grow into a complex chore as you add more auxiliary tables.

Database design, huge number of parameters, denormalise?

Given the table tblProject. This has a myriad of properties. For example, width, height etc etc. Dozens of them.
I'm adding a new module which lets you specify settings for your project for mobile devices. This is a 1-1 relationship, so all the mobile settings should be stored in tblProject. However, the list is getting huge, there will be some ambiguity amongst properties (IE, I will have to prefix all mobile fields with MOBILE so that Mobile_width isn't confused with width).
How bad is it to denormalise and store the mobile settings in another table? Or a better way to store the settings? The properties and becoming unwieldly and hard to modify/find in the table.

I want to respond to #Alexander Sobolev's suggestion and provide my own.
#Alexander Sobolev suggests an EAV model. This trades maximum flexibility, for poor performance and complexity as you need to join multiple times to get all values for an entity. The way you typically work around those issues is keeping all the entity meta data in memory (i.e. tblProperties) so you don't join to it at runtime. And, denormalize the values (i.e. tblProjectProperties) as a CLOB (i.e. XML) off the root table. Thus you only use the values table for querying and sorting, but not to actually retrieve the data. Also you usually end up caching the actual entities by ID as well so you don't have the expense of deserialization each time. Issues you run into the are cache invalidation of the entities and their meta data. So overall a non trivial approach.
What I would do instead is create a separate table, perhaps more than one depending on your data, with a discriminator/type column:
create table properties (
root_id int,
type_id int,
height int
width int
...etc...
)
Make the unique a combination of root_id and type_id, where type_id would be representative of mobile for instance - assuming a separate lookup table in my example.

There is nothing bad in storing mobile section in other table. This could even carry some economy, this depends on how much this information is used.
You can store in another table or use even more complicated version with three tables. One is your tblProject, one is tblProperties and one is tblProjectProperties.
create table tblProperties (
id int autoincrement(1,1) not null,
prop_name nvarchar(32),
prop_description nvarchar(1024)
)
create table tblProjectProperties
(
ProjectUid int not null,
PropertyUid int not null,
PropertyValue nvarchar(256)
)
with foreign key tblProjectProperties. ProjectUid -> tblProject.uid
and foreign key tblProjectProperties.propertyUid -> tblProperties.id
Thing is if you have different types of projects wich use different properties, you have no need to store all these unused null and store only properties you really need for given project. Above schema gives you some flexibility. You can create some views for different project types and use it to avoid too much joins in user selects.

What is the best way to keep changes history to database fields?

For example I have a table which stores details about properties. Which could have owners, value etc.
Is there a good design to keep the history of every change to owner and value. I want to do this for many tables. Kind of like an audit of the table.
What I thought was keeping a single table with fields
table_name, field_name, prev_value, current_val, time, user.
But it looks kind of hacky and ugly. Is there a better design?
Thanks.

There are a few approaches
Field based
audit_field (table_name, id, field_name, field_value, datetime)
This one can capture the history of all tables and is easy to extend to new tables. No changes to structure is necessary for new tables.
Field_value is sometimes split into multiple fields to natively support the actual field type from the original table (but only one of those fields will be filled, so the data is denormalized; a variant is to split the above table into one table for each type).
Other meta data such as field_type, user_id, user_ip, action (update, delete, insert) etc.. can be useful.
The structure of such records will most likely need to be transformed to be used.
Record based
audit_table_name (timestamp, id, field_1, field_2, ..., field_n)
For each record type in the database create a generalized table that has all the fields as the original record, plus a versioning field (additional meta data again possible). One table for each working table is necessary. The process of creating such tables can be automated.
This approach provides you with semantically rich structure very similar to the main data structure so the tools used to analyze and process the original data can be easily used on this structure, too.
Log file
The first two approaches usually use tables which are very lightly indexed (or no indexes at all and no referential integrity) so that the write penalty is minimized. Still, sometimes flat log file might be preferred, but of course functionally is greatly reduced. (Basically depends if you want an actual audit/log that will be analyzed by some other system or the historical records are the part of the main system).

A different way to look at this is to time-dimension the data.
Assuming your table looks like this:
create table my_table (
my_table_id number not null primary key,
attr1 varchar2(10) not null,
attr2 number null,
constraint my_table_ak unique (attr1, att2) );
Then if you changed it like so:
create table my_table (
my_table_id number not null,
attr1 varchar2(10) not null,
attr2 number null,
effective_date date not null,
is_deleted number(1,0) not null default 0,
constraint my_table_ak unique (attr1, att2, effective_date)
constraint my_table_pk primary key (my_table_id, effective_date) );
You'd be able to have a complete running history of my_table, online and available. You'd have to change the paradigm of the programs (or use database triggers) to intercept UPDATE activity into INSERT activity, and to change DELETE activity into UPDATing the IS_DELETED boolean.
Unreason:
You are correct that this solution similar to record-based auditing; I read it initially as a concatenation of fields into a string, which I've also seen. My apologies.
The primary differences I see between the time-dimensioning the table and using record based auditing center around maintainability without sacrificing performance or scalability.
Maintainability: One needs to remember to change the shadow table if making a structural change to the primary table. Similarly, one needs to remember to make changes to the triggers which perform change-tracking, as such logic cannot live in the app. If one uses a view to simplify access to the tables, you've also got to update it, and change the instead-of trigger which would be against it to intercept DML.
In a time-dimensioned table, you make the strucutural change you need to, and you're done. As someone who's been the FNG on a legacy project, such clarity is appreciated, especially if you have to do a lot of refactoring.
Performance and Scalability: If one partitions the time-dimensioned table on the effective/expiry date column, the active records are in one "table", and the inactive records are in another. Exactly how is that less scalable than your solution? "Deleting" and active record involves row movement in Oracle, which is a delete-and-insert under the covers - exactly what the record-based solution would require.
The flip side of performance is that if the application is querying for a record as of some date, partition elimination allows the database to search only the table/index where the record could be; a view-based solution to search active and inactive records would require a UNION-ALL, and not using such a view requires putting the UNION-ALL in everywhere, or using some sort of "look-here, then look-there" logic in the app, to which I say: blech.
In short, it's a design choice; I'm not sure either's right or either's wrong.

In our projects we usually do it this way:
You have a table
properties(ID, value1, value2)
then you add table
properties_audit(ID, RecordID, timestamp or datetime, value1, value2)
ID -is an id of history record(not really required)
RecordID -points to the record in original properties table.
when you update properties table you add new record to properties_audit with previous values of record updated in properties. This can be done using triggers or in your DAL.
After that you have latest value in properties and all the history(previous values) in properties_audit.

I think a simpler schema would be
table_name, field_name, value, time, userId
No need to save current and previous values in the audit tables. When you make a change to any of the fields you just have to add a row in the audit table with the changed value. This way you can always sort the audit table on time and know what was the previous value in the field prior to your change.