Ship management database structure discussion (should denormalize?) - database

My software went in production some days ago and now I want to argue a bit about the database structure.
The software collects data about ships, currently 174 details for each ship, each detail can be a text value, a long text value, a number (of a specified length, with or without a specified number of decimals), a date, a date with time, a boolean field, a menu with many values, a list of data and more.
I solved the problem with the following tables
Ship:
- ID - smallint, Autoincrement identity
- IMO - int, A number that does not change for the life of the ship
ShipDetailType:
- ID - smallint, Autoincrement identity
- Description - nvarchar(200), The description of the value the field contains
- Position - smallint, The position of the field in the data input form
- ShipDetailGroup_ID - smallint, A key to the group the field belongs to in the data input form
- Type - varchar(4), The type of the field as mentioned above
ShipDetailGroup
- ID - smallint, Autoincrement identity
(snip...)
ShipMenuPresetValue
- ID - smallint, Autoincrement identity
- ShipDetailType_ID - smallint, A key to the detail the values belongs to
- Value - nvarchar(100), The values preset in the menu type detail
ShipTextDetail
- ID - smallint, Autoincrement identity
- Ship_ID - smallint, A Key to the ship the detail belongs to
- ShipDetailType_ID - smallint, a Key to the detail type of the value
- Text - nvarchar(500), the field containing the detail's value
- ModifiedDate - smalldatetime
- User_ID - smallint, A key to the user table
ShipTextDetailHistory
(snip...)
This table is the same as the ShipTextDetail and contains every change to the details.
Other tables for the list detail type, each with the specified fields required for the list, ...
I just read this article: http://thedailywtf.com/Articles/The_Inner-Platform_Effect.aspx and http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:10678084117056
The articles says that this is not the right way to handle the problem.
My customer has a management gui for the details and groups as he changes the details descriptions and adds more details.
The data input form is dynamically built by reading the structure from the DetailGroups and DetailTypes, each detail type generates a specified input control.
The comments suggests that another way of solving this matter is dynamically creating and removing columns from the table.
What do you think?
Diagram Screenshot: http://img24.imageshack.us/my.php?image=66604496uk3.png

I would refactor your code if:
Your customer complained
You found something that didn't work
You found a way that the code couldn't handle a
change you knew was going to happen
in the future.
You remembered to write unit tests that will allow you to refactor, right?
*As far as the structure you have there, I've seen structures like it before. It's a little cumbersome but it is standard in many places. One thing to remember is that while its possible to dynamically add and remove columns from databases, the internal storage mechanism of the database doesn't necessarily expect you to be adding and removing these columns continuously. But I don't think this is very relevant compared to the points above, which boil down to: *Does it work?

I've seen this approach before and it's presented loads of performance issues once the data volume has grown. The kind of problems you'll encounter come when you need to return multiple items and use multiple criteria in your where clause. You join back and forth between Ship and ShipTextDetail to get all your select columns - maybe you have to do that 10/20 times ? You then do the same for your criteria maybe 2-3 times. Now you have a query with so many joins it runs really slowly. Next you 'pre-cook' some of the data to improve performance, ie you drag out common data into a fixed table structure - ah you've returned to a semi-normalised model.
My recommendation would be this - you know the information for 174 fields those are your core attributes. Your customer may add to that list, and may change the description of the fields, but it's still a really good starting point. Create a proper DataModel based around those, and then build in an extensability mechanism, as you have already done, but only for the new fields. The metadata - the descriptions of the fields, can reside in another table, or potentially in a resource file (useful for internationalisation?) and that gives some flexibility for existing fields.
I agree with Joe, you may not have problems if your DB is small, ie <1000 ships and your selects are simple. Although with 174 attributes to chose from this doesn't appear likely. I think you should change some of the 'obvious' fields first, ie I'd assume you have a Ship.Name, Ship.Owner, Ship.Weight, Ship.Registration ...
Good Luck.

I've done similar things, but there are a couple problems with this specific implementation:
You are storing numbers, booleans, dates, etc. as strings. This might be less than ideal. An alternative is to implement separate classes (inheriting from a base) for the different data types then store them in tables made for their data type.
Do the properties that you track change very frequently? Are they a different set per tanker? If not, it might be better to make objects rather than property bags to store all the data. Those objects can then be persisted to the database.

From a performance standpoint, either approach will be fine. How many ships could there possibly be? All the data is going to fit into RAM on any server.

Related

DB Normalization and single field break outs

My question has probably been asked many times, but I can't quite find it (nor has googling been very good).
I'm trying to normalize our DB. Here is the example:
Say we currently have a single table:
Property
---------
id
name
type
types can either be:
multi-family
single-family
healthcare
commercial
I could break this into a separate table so that we have:
Property Prop_Type
-------- ----------
id prop_id
name type
type_id
According to 2-n, I should break this up. But what am I actually saving in performance? I agree that breaking up tables like this makes it easier for us to insert new types of real estate, or modify current ones. But assuming that this isn't very necessary, would this result in a performance increase? The field Property.type is holding up to a 32 byte string versus a Property.type_id which is similar (no?). Plus there is an additional table required in the second option, and a join every time we want to access that data. Finally, our DB is not that large (maybe tens of thousands of records), so space saving is not a priority.
Should I continue to normalize or should I hold off on these small individual breaks?
Thank you!
Should I continue to normalize or should I hold off on these small individual breaks?
Normalization to higher normal forms replaces a table by others using the same columns that join back to the original based on functional dependencies and join dependencies.
According to 2-n, I should break this up
Presumably you mean 2NF. You have not given any information to justify that. And what you discuss doing has nothing to do with normalization.
Looks like you undertand litte about normalization. Get a reference presenting and explaining its issues, definitions and procedures. Use them. Quote them.
But what am I actually saving in performance?
Normalization should be done regardless of performance. You change when justified by the demonstrated present value of changing to another particular design based on the ideal/original.
It's not meaningful to talk about a design's performance without having been given details for a particular DBMS implementation plus expected use. But roughly speaking introducing ids uses less space but causes more joins.
DBMSes exist to have information stored in tables queried by algebra and/or conditions as implemented by the DBMS. Just make the most straightforard design. You need to understand way more about schemas and querying before you will know enough to modify a design for performance.
I agree that breaking up tables like this makes it easier for us to insert new types of real estate,
No, it makes it harder. All you used to have to do is enter the type value you wanted in a Property row. With ids you have to add a Prop_Type row and use that type_id in a Property row.
If possible values for Property type are fixed then add a CHECK constraint on Property type:
CHECK(type IN ('multi-family','single-family','healthcare','commercial'))
(Otherwise, don't.)
If you want possible values for properties to be updated and queried without a schema change and there does not have to be a property for every type then that is something that your original design cannot express. But you still don't need to introduce ids; you can have a Prop_Type table with just a type column and a foreign key from Property type to Prop_Type type.
I think that this is not a normalization problem.
The type column is essentially a discrete type, i.e. has a finite set of values - currently multi-family, single-family, healthcare, commercial.
What you want is to control that no invalid value is inserted into the column. Your prop_type table and a foreign key constraint is one solution.
A more suitable solution is to use a CHECK CONSTRAINT on the column:
CREATE TABLE Property
(
id int PRIMARY KEY,
name ...,
type varchar(20) CONSTRAINT typeValues CHECK (type IN ('multi-family', 'single-family', 'healthcare', 'commercial'))
)
Going further there is no need to store the complete type string in every record. You could simply use a single character to encode the type:
CREATE TABLE Property
(
...
type char(1) CONSTRAINT typeValues CHECK (type IN ('M', 'S', 'H', 'C'))
)
When you present the type, e.g. in a GUI, you would need to translate them into user readable text. To enter a value you would use a dropdown in the GUI.

Single Big SQL Server lookup table

I have a SQL Server 2008 database with a snowflake-style schema, so lots of different lookup tables, like Language, Countries, States, Status, etc. All these lookup table have almost identical structures: Two columns, Code and Decode. My project manager would like all of these different tables to be one BIG table, so I would need another column, say CodeCategory, and my primary key columns for this big table would be CodeCategory and Code. The problem is that for any of the tables that have the actual code (say Language Code), I cannot establish a foreign key relationship into this big decode table, as the CodeCategory would not be in the fact table, just the code. And codes by themselves will not be unique (they will be within a CodeCategory), so I cannot make an FK from just the fact table code field into the Big lookup table Code field.
So am I missing something, or is this impossible to do and still be able to do FKs in the related tables? I wish I could do this: have a FK where one of the columns I was matching to in the lookup table would match to a string constant. Like this (I know this is impossible but it gives you an idea what I want to do):
ALTER TABLE [dbo].[Users] WITH CHECK ADD CONSTRAINT [FK_User_AppCodes]
FOREIGN KEY('Language', [LanguageCode])
REFERENCES [dbo].[AppCodes] ([AppCodeCategory], [AppCode])
The above does not work, but if it did I would have the FK I need. Where I have the string 'Language', is there any way in T-SQL to substitute the table name from code instead?
I absolutely need the FKs so, if nothing like this is possible, then I will have to stick with my may little lookup tables. any assistance would be appreciated.
Brian
It is not impossible to accomplish this, but it is impossible to accomplish this and not hurt the system on several levels.
While a single lookup table (as has been pointed out already) is a truly horrible idea, I will say that this pattern does not require a single field PK or that it be auto-generated. It requires a composite PK comprised of ([AppCodeCategory], [AppCode]) and then BOTH fields need to be present in the fact table that would have a composite FK of both fields back to the PK. Again, this is not an endorsement of this particular end-goal, just a technical note that it is possible to have composite PKs and FKs in other, more appropriate scenarios.
The main problem with this type of approach to constants is that each constant is truly its own thing: Languages, Countries, States, Statii, etc are all completely separate entities. While the structure of them in the database is the same (as of today), the data within that structure does not represent the same things. You would be locked into a model that either disallows from adding additional lookup fields later (such as ISO codes for Language and Country but not the others, or something related to States that is not applicable to the others), or would require adding NULLable fields with no way to know which Category/ies they applied to (have fun debugging issues related to that and/or explaining to the new person -- who has been there for 2 days and is tasked with writing a new report -- that the 3 digit ISO Country Code does not apply to the "Deleted" status).
This approach also requires that you maintain an arbitrary "Category" field in all related tables. And that is per lookup. So if you have CountryCode, LanguageCode, and StateCode in the fact table, each of those FKs gets a matching CategoryID field, so now that is 6 fields instead of 3. Even if you were able to use TINYINT for CategoryID, if your fact table has even 200 million rows, then those three extra 1 byte fields now take up 600 MB, which adversely affects performance. And let's not forget that backups will take longer and take up more space, but disk is cheap, right? Oh, and if backups take longer, then restores also take longer, right? Oh, but the table has closer to 1 billion rows? Even better ;-).
While this approach looks maybe "cleaner" or "easier" now, it is actually more costly in the long run, especially in terms of wasted developer time, as you (and/or others) in the future try to work around issues related to this poor design choice.
Has anyone even asked your project manager what the intended benefit of this is? It is a reasonable question if you are going to spend some amount of hours making changes to the system that there be a stated benefit for that time spent. It certainly does not make interacting with the data any easier, and in fact will make it harder, especially if you choose a string for the "Category" instead of a TINYINT or maybe SMALLINT.
If your PM still presses for this change, then it should be required, as part of that project, to also change any enums in the app code accordingly so that they match what is in the database. Since the database is having its values munged together, you can accomplish that in C# (assuming your app code is in C#, if not then translate to whatever is appropriate) by setting the enum values explicitly with a pattern of the first X digits are the "category" and the remaining Y digits are the "value". For example:
Assume the "Country" category == 1 and the "Language" catagory == 2, you could do:
enum AppCodes
{
// Countries
United States = 1000001,
Canada = 1000002,
Somewhere Else = 1000003,
// Languages
EnglishUS = 2000001,
EnglishUK = 2000002,
French = 2000003
};
Absurd? Completely. But also analogous to the request of merging all lookup tables into a single table. What's good for the goose is good for the gander, right?
Is this being suggested so you can minimise the number of admin screens you need for CRUD operations on your standing data? I've been here before and decided it was better/safer/easier to build a generic screen which used metadata to decide what table to extract from/write to. It was a bit more work to build but kept the database schema 'correct'.
All the standing data tables had the same basic structure, they were mainly for dropdown population with occasional additional fields for business rule purposes.

Custom Fields for a Form representing an object

I have an architectural question concerning custom fields in a view for an object. Let's say you have a User Object with some basic information like firstname, lastname, ... that can be used by all customers.
Now, often we get a question from a customer to add couple of custom fields typical for their domain. Our solution now is an xml data column where key value pairs are stored. This has been ok so far, but now we'll have to find a more architectural solution.
For instance, now, a customer wants a dropdown where it can select the value for its custom field. We could still store the selected value in the xml data column, but where do we store all those dropdown values...
I know that in sharepoint you can also add custom fields like dropdowns and I was wondering how to deal with this best. I want to avoid creating custom tables for customers, or having a table with 90 columns (10 basic and then 10 for each customer), ...
You get the idea, it should be generic and be able to deal with all sorts of problems in the future.
What I was thinking about is a Table UserConfiguration where each record has a Foreign Key to the Customer (Channel in our database), then a column FieldName, a column FieldType and a column Values. The column values should be an xml type column, because for a dropdown, we'll need to add multiple values. Also, each value can have extra data attached to it (not just a name). The other problem then is how to store the selected value. I don't like the idea of having foreign keys to xml in my database (read somewhere that Azure can't handle this all to well). Do you just store the name of the value (what if the value were to disappear out of the xml?)?
Any documentation, links on this kind of problems would also be great. I'm trying to find a design pattern that deals with this kind of problem in the database.
I want to answer your question in two parts:
1) Implementing custom fields in a database server
2) Restricting custom fields to an enumeration of values
Although common solutions to 1) are discussed in the question referenced by #Simon, maybe you are looking for a bit of discussion on what the problem is and why it hasn't been solved for us already.
databases are great for structured, typed data
custom fields are inherently less structured
therefore, custom fields are more difficult to work with in a database
some or many of the advantages of using a database are lost
some queries may be more difficult or impossible
type safety may be lost (in the database)
data integrity may no longer be enforced (by the database)
it's a lot more work for the implementers and maintainers
As discussed in the other question, there's no perfect solution.
But these benefits/features still need to be implemented somewhere, and so often the application becomes responsible for data integrity and type safety.
For situations like these, people have created Object-Relation Mapping tools, although, as Jeff Atwood says, even using an ORM could create more problems than it solved. However, you mentioned that it 'should be generic and be able to deal with all sorts of problems in the future' -- this makes me think an ORM might be your best bet.
So, to sum up my answer, this is a known problem with known solutions, none of which are completely satisfactory (because it's so hard). Pick your poison.
To answer the second part of (what I think is) your question:
As mentioned in the linked question, you could implement Entity-Attribute-Value in your database for custom fields, and then add an extra table to hold the legal values for each entity. Then, the attribute/value of the EAV table is a foreign key into the attribute-value table.
For example,
CREATE TABLE `attribute_value` ( -- enumerations go in this table
`attribute` varchar(30),
`value` varchar(30),
PRIMARY KEY (`attribute`, `value`)
);
CREATE TABLE `eav` ( -- now the values of attributes are restricted
`entityid` int,
`attribute` varchar(30),
`value` varchar(30),
PRIMARY KEY (`entityid`, `attribute`),
FOREIGN KEY (`attribute`, `value`) REFERENCES `attribute_value`(`attribute`, `value`)
);
Of course, this solution isn't perfect or complete -- it's only supposed to illustrate the idea. For instance, it uses varchars, and lacks a type column. Also, who gets to decide what the possible values for each attribute are? Can these be changed at any time by the user?
I'm doing something similar for a customer. I've create a JSON FieldType which holds the entire JSON stream of a complex object and a String containing the FQTN (FullQualifiedTypeName) of my C# model class.
By using custom New-, Edit- and Display-Forms we'd ensured that our custom objects are rendered the correct way for best user experience.
To promote fields from the complex C# model to the SharePoint list, we've build something like Microsoft did in InfoPath. Users are able to select Properties or MetaData from the Complex C# type, which will be automatically promoted to the hosting SharePoint list.
The big advantage of JSON is, that its smaller than XML and easier to work with in the web world. (JavaScript...)
When you let the users create the data models, I would recommend looking at an document database or 'NoSQL' since you want exactly that, to store schemaless data structures.
Also, sharePoint stores metadata the way you mentioned (10 columns for text, 5 for dates etc)
That said, in my current project (locked in SharePoint, so Framework 3.5 + SQL Server and all the constraints that follow) we use a somewhat similar structure as below:
Form
Id
Attribute (or Field)
Name
Type (enum) Text, List, Dates, Formulas etc
Hidden (bool)
Mandatory
DefaultValue
Options (for lists)
Readonly
Mask (for SSN etc)
Length (for text fields)
Order
Metadata
FormId
AttributeId
Text (the value for everything but dates)
Date (the value for dates)
Our formulas employ functions such as Increment: INC([attribute1][attribute2], 6) and this would produce something like 000999 for the 999th instance of the combined values for attribute 1 and attribute 2 for a form, this is stored as:
AttributeIncrementFormula
AtributeId
Counter
Token
Other 'formulas' (aka anything non-trivial) such as barcodes are stored as single metadata values. In the actual implementation, we would have something like this:
var form = formRepository.GetById(1);
form.Metadata["firstname"].Value
Value above is a readonly property that decides whether we should get the value from Text or Date and if some additional transform is required. Note that the database here is merely a storage, we hold all the domain complexity in the application.
We also let our customer decide which attribute is the form title for example, so if firstname is the form title, they'll set an in-memory param that spans the entire application to be something like Params.InMemory.TitleAttributeId = <user-defined-id>.
I hope this gives you some insight on a production impl of a similar scenario.
This is really more of a comment than an answer, but I need more space than SO will allow for comments, so here 'tis:
I think your UserConfiguration table approach is good, and would suggest only abstracting the "type" and "value" pieces of your design a bit more:
Since your application will need to validate user input, each notion of "type" will have an associated piece of evaluation logic. Obviously the more of this you can abstract into data the easier it will be to keep your code small. Enumerated lists are a good start, but if your "validator" logic can be extended to handle pattern matching for text strings and Boolean logical expressions (e.g. to describe/enforce constraints on input values), then you can express pretty much any "type" of input that your application may need to handle in terms of (relatively) simple "atoms" that you can map naturally to DB tables.
When storing a user-specified value, you can either store the "raw" data (e.g. in JSON) and a foreign key to the associated "type", or you can add an lookup/cache system that assigns an integer to each new value that is encountered by the system ("novelty" can be checked by checking a hash of the "raw" data, for example). The latter approach obviously scales better if you're expecting lots of data duplication (which of course you would in the case of a multiple-choice menu).

indexing pros cons in sql server 2008

I am working on social networking site. now our team decide to store user profile in denormalized manner. so our table structure is like this
here attribute it means one fields for user profile e.g. Firstname,LastName,BirthDate etc...
and groups means name of a group of fields e.g. Personal Details, Academic Info, Achievements etc..
**
Attribute/Groups master - it creates
hierarchy of groups and attributes.
**
Attribute_GroupId bigint
ParentId bigint
Attribute_GroupName nvarchar(1000)
ISAttribute bit
DisplayName nvarchar(1000)
DisplaySequence int
**
Attribute Control Info - stores which
control have to be populated at run
time for the attribute as well as its
validation criteria...
**
Attribute_ControlInfoId bigint
AttributeId bigint
ControlType nvarchar(1000)
DataType nvarchar(1000)
DefaultValue nvarchar(1000)
IsRequired bit
RegulareExpression nvarchar(1000)
**
And finally Attribute Values where for
every attribute , user wise values
will be stored
**
AttributeId bigint Checked
IsValueOrRefId bit Checked
Value nvarchar(MAX) Checked
ReferenceDataId bigint Checked
UserId bigint Checked
Unchecked
Now they are saying that we'll create index on Attribute Values table. there is no primary key also there.
AS there's huge data going to be stored in this table. e.g. if there are 50 million users and 30 attributes are there it'll store 1500 million records. in this case if we create index on table, isn't Insert and Update statement will be very slow as well as at time of data fetching for one user. quires will also be very slow.
i thought one option for that like instead of attribute wise values i can store one XML record for one user.
so, please can anybody help me out to find out best option for this case. how should i store data?
here i can not make hard code table because at any time new fields can be added by administrator so i need some data structure where i can easily add any fields in user profile with 1-2 steps only.
please reply me if anybody has better solution for this.
You guys need a dba!
This is one of those EAV tables that is going to bite you down the road!
Bill Karwin (his blog) put together a SQL Anti-patterns PPT
Link 1
Link 2
He offers 3 alternate solution to EAV.
Indexing is the least of your worries...
Check out those articles which highlight just how bad that design choice is, and what potential problems you're getting yourself into if you stick to that design:
Five Simple Database Design Errors You Should Avoid
Joe Celko: Avoiding the EAV of Destruction
Bad CaRMa
It seems to be a fairly common design problem - and it seems like a good idea to programmers to solve it that way, with a attribute/value table - but it's really not a good idea from a database performance point of view.
Also:
Now they are saying that we'll create
index on Attribute Values table. there
is no primary key also there.
As some SQL gurus like to say: "If it doesn't have a primary key, it's not a table".
You definitely need to find a way to get a primary key onto your tables - if you don't have anything that you can use per se, add a column "ID" of type "INT IDENTITY(1,1)" to it and put the primary key on that column. You need a primary key! Database design, first lesson, first five minutes....
You need to rethink your design and come up with something more clever to store the data you need.

Should I make specification table referenceable?

Since I know there are lots of expert database core designers here, I decided to ask this question on stackoverflow.
I'm developing a website whose main concern is to index every product that is available in the real world, like digital cameras, printers, refrigerators, and so on. As we know, each product has its own specifications. For example, a digital camera has its weight, lens, speed of shutter, etc. Each Specification has a type. For example, price (I see it like a spec) is a number.
I think the most standard way is to create whatever specs are needed for a specified product with its proper type and assign it to the product. So for each separate product PRICE has to be created and the type number should be set on it.
So here is my question, Is it possible to have a table for specs with all specs in it so for example PRICE with type of number has been created before and just need to search for the price in the table and assign it to the product. The problem with this method is I don't see a good way to prevent the user from creating duplicate entries. He has to be able to find the spec he needs (if it's been added before), and I also want him to know that the spec he finds is actually is the one he needed, since there may be some specs with the same name but different type and usage. If he doesn't find it, he will create it.
Any ideas?
---------------------------- UPDATE ----------------------------
My question is not about db flexibility. I think that in the second method users will mess the specs table up! They will create thousand of duplicate entries and also i think they wont find their proper specs.
I have just finished answering Dynamic Table Generation
which discusses a similar problem. Take a look at the observation pattern. If you replace "observation" by "specification" and "subject" by "product" you may find this model useful -- you will not need Report and Rep_mm_Obs tables.
My suggested data model based on your requirements:
SPECIFICATIONS table
SPECIFICATION_ID, pk
SPECIFICATION_DESCRIPTION
This allows you to have numerous specifications, without being attached to an item.
ITEM_SPECIFICATION_XREF table
ITEM_ID, pk, fk to ITEMS table
SPECIFICATION_ID, pk, fk to SPECIFICATIONS table
VALUE, pk
Benefits:
Making the primary key to be a composite ensures the set of values will be unique throughout the table. Blessing or curse, an item with a given specification could have values 0.99 and 1.00 - these would be valid.
This setup allows for a specification to be associated with 0+ items.

Resources