Custom Fields for a Form representing an object - database

I have an architectural question concerning custom fields in a view for an object. Let's say you have a User Object with some basic information like firstname, lastname, ... that can be used by all customers.
Now, often we get a question from a customer to add couple of custom fields typical for their domain. Our solution now is an xml data column where key value pairs are stored. This has been ok so far, but now we'll have to find a more architectural solution.
For instance, now, a customer wants a dropdown where it can select the value for its custom field. We could still store the selected value in the xml data column, but where do we store all those dropdown values...
I know that in sharepoint you can also add custom fields like dropdowns and I was wondering how to deal with this best. I want to avoid creating custom tables for customers, or having a table with 90 columns (10 basic and then 10 for each customer), ...
You get the idea, it should be generic and be able to deal with all sorts of problems in the future.
What I was thinking about is a Table UserConfiguration where each record has a Foreign Key to the Customer (Channel in our database), then a column FieldName, a column FieldType and a column Values. The column values should be an xml type column, because for a dropdown, we'll need to add multiple values. Also, each value can have extra data attached to it (not just a name). The other problem then is how to store the selected value. I don't like the idea of having foreign keys to xml in my database (read somewhere that Azure can't handle this all to well). Do you just store the name of the value (what if the value were to disappear out of the xml?)?
Any documentation, links on this kind of problems would also be great. I'm trying to find a design pattern that deals with this kind of problem in the database.

I want to answer your question in two parts:
1) Implementing custom fields in a database server
2) Restricting custom fields to an enumeration of values
Although common solutions to 1) are discussed in the question referenced by #Simon, maybe you are looking for a bit of discussion on what the problem is and why it hasn't been solved for us already.
databases are great for structured, typed data
custom fields are inherently less structured
therefore, custom fields are more difficult to work with in a database
some or many of the advantages of using a database are lost
some queries may be more difficult or impossible
type safety may be lost (in the database)
data integrity may no longer be enforced (by the database)
it's a lot more work for the implementers and maintainers
As discussed in the other question, there's no perfect solution.
But these benefits/features still need to be implemented somewhere, and so often the application becomes responsible for data integrity and type safety.
For situations like these, people have created Object-Relation Mapping tools, although, as Jeff Atwood says, even using an ORM could create more problems than it solved. However, you mentioned that it 'should be generic and be able to deal with all sorts of problems in the future' -- this makes me think an ORM might be your best bet.
So, to sum up my answer, this is a known problem with known solutions, none of which are completely satisfactory (because it's so hard). Pick your poison.
To answer the second part of (what I think is) your question:
As mentioned in the linked question, you could implement Entity-Attribute-Value in your database for custom fields, and then add an extra table to hold the legal values for each entity. Then, the attribute/value of the EAV table is a foreign key into the attribute-value table.
For example,
CREATE TABLE `attribute_value` ( -- enumerations go in this table
`attribute` varchar(30),
`value` varchar(30),
PRIMARY KEY (`attribute`, `value`)
);
CREATE TABLE `eav` ( -- now the values of attributes are restricted
`entityid` int,
`attribute` varchar(30),
`value` varchar(30),
PRIMARY KEY (`entityid`, `attribute`),
FOREIGN KEY (`attribute`, `value`) REFERENCES `attribute_value`(`attribute`, `value`)
);
Of course, this solution isn't perfect or complete -- it's only supposed to illustrate the idea. For instance, it uses varchars, and lacks a type column. Also, who gets to decide what the possible values for each attribute are? Can these be changed at any time by the user?

I'm doing something similar for a customer. I've create a JSON FieldType which holds the entire JSON stream of a complex object and a String containing the FQTN (FullQualifiedTypeName) of my C# model class.
By using custom New-, Edit- and Display-Forms we'd ensured that our custom objects are rendered the correct way for best user experience.
To promote fields from the complex C# model to the SharePoint list, we've build something like Microsoft did in InfoPath. Users are able to select Properties or MetaData from the Complex C# type, which will be automatically promoted to the hosting SharePoint list.
The big advantage of JSON is, that its smaller than XML and easier to work with in the web world. (JavaScript...)

When you let the users create the data models, I would recommend looking at an document database or 'NoSQL' since you want exactly that, to store schemaless data structures.
Also, sharePoint stores metadata the way you mentioned (10 columns for text, 5 for dates etc)
That said, in my current project (locked in SharePoint, so Framework 3.5 + SQL Server and all the constraints that follow) we use a somewhat similar structure as below:
Form
Id
Attribute (or Field)
Name
Type (enum) Text, List, Dates, Formulas etc
Hidden (bool)
Mandatory
DefaultValue
Options (for lists)
Readonly
Mask (for SSN etc)
Length (for text fields)
Order
Metadata
FormId
AttributeId
Text (the value for everything but dates)
Date (the value for dates)
Our formulas employ functions such as Increment: INC([attribute1][attribute2], 6) and this would produce something like 000999 for the 999th instance of the combined values for attribute 1 and attribute 2 for a form, this is stored as:
AttributeIncrementFormula
AtributeId
Counter
Token
Other 'formulas' (aka anything non-trivial) such as barcodes are stored as single metadata values. In the actual implementation, we would have something like this:
var form = formRepository.GetById(1);
form.Metadata["firstname"].Value
Value above is a readonly property that decides whether we should get the value from Text or Date and if some additional transform is required. Note that the database here is merely a storage, we hold all the domain complexity in the application.
We also let our customer decide which attribute is the form title for example, so if firstname is the form title, they'll set an in-memory param that spans the entire application to be something like Params.InMemory.TitleAttributeId = <user-defined-id>.
I hope this gives you some insight on a production impl of a similar scenario.

This is really more of a comment than an answer, but I need more space than SO will allow for comments, so here 'tis:
I think your UserConfiguration table approach is good, and would suggest only abstracting the "type" and "value" pieces of your design a bit more:
Since your application will need to validate user input, each notion of "type" will have an associated piece of evaluation logic. Obviously the more of this you can abstract into data the easier it will be to keep your code small. Enumerated lists are a good start, but if your "validator" logic can be extended to handle pattern matching for text strings and Boolean logical expressions (e.g. to describe/enforce constraints on input values), then you can express pretty much any "type" of input that your application may need to handle in terms of (relatively) simple "atoms" that you can map naturally to DB tables.
When storing a user-specified value, you can either store the "raw" data (e.g. in JSON) and a foreign key to the associated "type", or you can add an lookup/cache system that assigns an integer to each new value that is encountered by the system ("novelty" can be checked by checking a hash of the "raw" data, for example). The latter approach obviously scales better if you're expecting lots of data duplication (which of course you would in the case of a multiple-choice menu).

Related

DB Normalization and single field break outs

My question has probably been asked many times, but I can't quite find it (nor has googling been very good).
I'm trying to normalize our DB. Here is the example:
Say we currently have a single table:
Property
---------
id
name
type
types can either be:
multi-family
single-family
healthcare
commercial
I could break this into a separate table so that we have:
Property Prop_Type
-------- ----------
id prop_id
name type
type_id
According to 2-n, I should break this up. But what am I actually saving in performance? I agree that breaking up tables like this makes it easier for us to insert new types of real estate, or modify current ones. But assuming that this isn't very necessary, would this result in a performance increase? The field Property.type is holding up to a 32 byte string versus a Property.type_id which is similar (no?). Plus there is an additional table required in the second option, and a join every time we want to access that data. Finally, our DB is not that large (maybe tens of thousands of records), so space saving is not a priority.
Should I continue to normalize or should I hold off on these small individual breaks?
Thank you!
Should I continue to normalize or should I hold off on these small individual breaks?
Normalization to higher normal forms replaces a table by others using the same columns that join back to the original based on functional dependencies and join dependencies.
According to 2-n, I should break this up
Presumably you mean 2NF. You have not given any information to justify that. And what you discuss doing has nothing to do with normalization.
Looks like you undertand litte about normalization. Get a reference presenting and explaining its issues, definitions and procedures. Use them. Quote them.
But what am I actually saving in performance?
Normalization should be done regardless of performance. You change when justified by the demonstrated present value of changing to another particular design based on the ideal/original.
It's not meaningful to talk about a design's performance without having been given details for a particular DBMS implementation plus expected use. But roughly speaking introducing ids uses less space but causes more joins.
DBMSes exist to have information stored in tables queried by algebra and/or conditions as implemented by the DBMS. Just make the most straightforard design. You need to understand way more about schemas and querying before you will know enough to modify a design for performance.
I agree that breaking up tables like this makes it easier for us to insert new types of real estate,
No, it makes it harder. All you used to have to do is enter the type value you wanted in a Property row. With ids you have to add a Prop_Type row and use that type_id in a Property row.
If possible values for Property type are fixed then add a CHECK constraint on Property type:
CHECK(type IN ('multi-family','single-family','healthcare','commercial'))
(Otherwise, don't.)
If you want possible values for properties to be updated and queried without a schema change and there does not have to be a property for every type then that is something that your original design cannot express. But you still don't need to introduce ids; you can have a Prop_Type table with just a type column and a foreign key from Property type to Prop_Type type.
I think that this is not a normalization problem.
The type column is essentially a discrete type, i.e. has a finite set of values - currently multi-family, single-family, healthcare, commercial.
What you want is to control that no invalid value is inserted into the column. Your prop_type table and a foreign key constraint is one solution.
A more suitable solution is to use a CHECK CONSTRAINT on the column:
CREATE TABLE Property
(
id int PRIMARY KEY,
name ...,
type varchar(20) CONSTRAINT typeValues CHECK (type IN ('multi-family', 'single-family', 'healthcare', 'commercial'))
)
Going further there is no need to store the complete type string in every record. You could simply use a single character to encode the type:
CREATE TABLE Property
(
...
type char(1) CONSTRAINT typeValues CHECK (type IN ('M', 'S', 'H', 'C'))
)
When you present the type, e.g. in a GUI, you would need to translate them into user readable text. To enter a value you would use a dropdown in the GUI.

Type tables and software hardcoded values

I have this DB model:
(TEXT is actually VARCHAR)
entity_group_type is not modifiable at runtime, but it will be modified in a near future to add more entries, several times, by the development team.
Now I need to retrieve all entries from entity that are of a given entity_group_type. How should the software handle this kind of queries? Should I hardcode entity_group_type _id/name in the software? If so, why do I even need this table then? And what's better, hardcode _id or name?
Or is this the wrong way to structure my data?
Thanks in advance!
Taking your questions one at a time:
How should you refer to the entity group in the software? Hard-code the id, or the name?
Refer to the entity groups in a way that makes your code the most readable. So, perhaps you use the name, or perhaps a constant that looks like the name but whose value is the id. Using a constant can avoid one join when you are looking up entities by group type, but usually this is not much of a concern.
Why do I even need that table in the DB then? Is this the wrong way to structure my data?
This is a perfectly acceptable way to structure your data. The most correct way depends on what you are doing with the data, but for most applications your structure would be correct. However, you certainly don't need that table in the database--you could instead have a "group_type" field on the entity_group table. Here are the pros and cons:
Advantages of the current structure:
Easy to add fields that describe the entity_group_type. For example, you might want to make some group types only viewable by admin users, or disabled, or whatever. If this is a possiblity for the future, it pretty much requires this database structure.
Ability to have your database software enforce referential integrity, meaning the data is kept consistent between the entity_group and entity_group_type tables.
Advantages of adding a group_type field on the entity_group table:
Possibly a simpler representation in your code. For exameple, if you use a MVC architecture, having the extra table might require another model object in your code. That's usually not a problem, and may have advantages, but sometimes simpler is better.
When you are looking up entities by entity group type, your SQL statements will be slightly simpler, since there will be one less table/join involved.
I think in most cases your current structure comes out ahead, although it does depend of how you use the data. Unless you had a good reason to structure the data differently, I would stick with your current structure.
I think you should definitely store entity_group_type in the database, in its own table, as per your design diagram above. Storing that information in the DB makes it possible to query against that information, which adds flexibility to your design.
Once you have this information in the DB, your question seems to be whether it should be broken out into its own table, or just be stored as column in the entity_type table. I think you should break it out into its own table, with a foreign key constraint from entity_type to entity_group_type.
Having entity_group_type in its own table allows for group types to be preserved even if all of the entity groups of that type are removed from the DB.
You can leverage the foreign key to ensure that the entity_group_type name is spelled consistently. Inserted entity_groups must satisfy the foreign key constraint, and so would have to either reference an existing entity_group_type having a properly spelled name, or insert an entity_group_type which will set the proper spelling for that new entity_group_type.
Use a synthetic key for your entity_group_type table, to reduce painful redundancy in the appearance of entity_group_type name appearance. This makes the data model more DRY, and allows updating the entity_group_type's name to be a simple update to one table.
As for storing the entity_group_type in your application logic, I would suggest storing the name, and looking up the id for the entity_group_type when it is needed. I think that would make the application codebase sowewhat more readable, and I think I've made a compelling argument for why I think this relevant information should have a representation in the database.
Given, that you know the name of your entity_group_type at runtime, then the correct way to find all entity_groups with that entity_group_type.name, you would use a join:
SELECT entity_group._id, entity_group.name, entity_group_type.name AS groupTypeName
FROM entity_group
JOIN entity_group_type ON entity_group.id_type = entity_group_type._id
WHERE entity_group_type.name = 'someGroupName';
This would result in all entity_group information for the given entity_group_type name.
And yes, this is a good or at least possible way to store that kind of information. Just imagine that eventually the entity_group_type gets a new attribute, e.g. disabled. Then it is still easily possible to find all entity_groups which are (not) in disabled types.

General database design: Is it ever considered "okay" to create a non-normalized table on purpose?

After-edit: Wow, this question go long. Please forgive =\
I am creating a new table consisting of over 30 columns. These columns are largely populated by selections made from dropdown lists and their options are largely logically related. For example, a dropdown labeled Review Period will have options such as Monthly, Semi-Annually, and Yearly. I came up with a workable method to normalize these options down to numeric identifiers by creating a primitives lookup table that stores values such as Monthly, Semi-Annually, and Yearly. I then store the IDs of these primitives in the table of record and use a view to join that table out to my lookup table. With this view in place, the table of record can contain raw data that only the application understands while allowing external applications and admins to run SQL against the view and return data that is translated into friendly information.
It just got complicated. Now these dropdown lists are going to have non-logically-related items. For example, the Review Period dropdown list now needs to have options of NA and Manual. This blows my entire grouping scheme out of the water.
Similar constructs that have been used in this application have resorted to storing repeated string values across multiple records. This means you could have hundreds of records with the string 'Monthly' stored in the table's ReviewPeriod column. The thought of this happening has made me cringe since I've started working here, but now I am starting to think that non-normalized data may be the best option here.
The only other way I can think of doing this using my initial method while allowing it to be dynamic and support the constant adding of new options to any dropdown list at any time is this: When saving the data to the database, iterate through every single property of my business object (.NET class in this case) and check for any string value that exists in the primitives table. If it doesn't, add it and return the auto-generated unique identifier for storage in the table of record. It seems so complicated, but is this what one is to go through for the sake of normalized data?
Anything is possible. Nobody is going to haul you off to denormalization jail and revoke your DBA card. I would say that you should know the rules and what breaking them means. Once you have those in hand, it's up to your and your best judgement to do what you think is best.
I came up with a workable method to normalize these options down to
numeric identifiers by creating a primitives lookup table that stores
values such as Monthly, Semi-Annually, and Yearly. I then store the
IDs of these primitives in the table of record and use a view to join
that table out to my lookup table.
Replacing text with ID numbers has nothing at all to do with normalization. You're describing a choice of surrogate keys over natural keys. Sometimes surrogate keys are a good choice, and sometimes surrogate keys are a bad choice. (More often a bad choice than you might believe.)
This means you could have hundreds of records with the string
'Monthly' stored in the table's ReviewPeriod column. The thought of
this happening has made me cringe since I've started working here, but
now I am starting to think that non-normalized data may be the best
option here.
Storing the string "Monthly" in multiple rows has nothing to do with normalization. (Or with denormalization.) This seems to be related to the notion that normalization means "replace all text with id numbers". Storing text in your database shouldn't make you cringe. VARCHAR(n) is there for a reason.
The only other way I can think of doing this using my initial method
while allowing it to be dynamic and support the constant adding of new
options to any dropdown list at any time is this: When saving the data
to the database, iterate through every single property of my business
object (.NET class in this case) and check for any string value that
exists in the primitives table. If it doesn't, add it and return the
auto-generated unique identifier for storage in the table of record.
Let's think about this informally for a minute.
Foreign keys provide referential integrity. Their purpose is to limit the values allowed in a column. Informally, the referenced table provides a set of valid values. Values that aren't in that table aren't allowed in the referencing column of other tables.
But no matter what the user types in, you're going to add it to that table of valid values.
If you're going to accept everything the user types in the first place, why use a foreign key at all?
The main problem here is that you've been poorly served by the people who taught you (mis-taught you) the relational model. (And, probably, equally poorly by the people who taught you SQL.) I hope you can unlearn those mistaken notions quickly, and soon make real progress.

Efficient way to store a dynamic questionnaire?

In reference to this question, I am facing almost the same scenario except that in my case, the questions are probably static (it's subject to change from time to time, and I still think it's not a good idea adding columns for each question, but even I decided to add, how should the answers be specified/retrieved from), but the answers are in different types, for examples the answer could be yes/no, list-items, free text, list-items OR free text (Other, Please specify), multiple-selectable-list items etc.
What would be an efficient way to implement this?
Shimmy, I have written a four-part article that addresses this issue - see Creating a Dynamic, Data-Drive User Interface. The article looks at how to let a user define what data to store about clients, so it's not an exact examination of your question, but it's pretty close. Namely, my article shows how to let an end user define the type of data to store, which is along the lines of what you want.
The following ER diagram gives the gist of the data model:
Here, DynamicAttributesForClients is the table that indicates what user-created attributes a user wants to track for his clients. In short, each attribute has a DataTypeId value, which indicates whether it's a Boolean attribute, a Text attribute, a Numeric attribute, and so on. In your case, this table would store the questions of the survey.
The DynamicValuesForClients table holds the values stored for a particular client for a particular attribute. In your case, this table would store the answers to the questions of the survey. The actual value is stored in the DynamicValue column, which is of type sql_variant, allowing any type of data - numeric, bit, string, etc. - to be stored there.
My article does not address how to handle multiple-choice questions, where a user may select one option from a preset list of options, but enhancing the data model to allow this is pretty straightforward. You would create a new table named DynamicListOptions with the following columns:
DynamicListOptionId - a primary key
DynamicAttributeId - specifies what attribute these questions are associated with
OptionText - the option text
So if you had an attribute that was a multiple-choice option you'd populate the drop-down list in the user interface with the options returned from the query:
SELECT OptionText
FROM DynamicListOptions
WHERE DynamicAttributeId = ...
Finally, you would store the selected DynamicListOptionId value in the DynamicValuesForClients.DynamicValue column to record the list option they selected (or use NULL if they did not choose an item).
Give the article a read through. There is a complete, working demo you can download, which includes the complete database and its model. Also, the four articles that make up the series explore the data model in depth and show how to build a web-based (ASP.NET) user interface for letting users define dynamic attributes, how to display them for data entry, and so forth.
Happy Programming!
This may not fit you exactly, but here's what i've got at my part-time job.
I have a questions table, an answers table, and a survey table. For each new survey i crate a survey build (because each survey is unique, but questions and answers are repeated a lot). I then have a respondent table that contains some information about the respondent (and it also links back to the survey table, forgot that in the diagram). I also have a response table that links the respondent and the survey build. This probably isn't the best way but it's the way that works for me, and it works pretty fast (we're at about 1mill+ in the response table and it handles like a dream).
With this model i get reusable questions, reusable answers (a lot of our questions use "Yes" and "No"), and a rather slim response table.

Ship management database structure discussion (should denormalize?)

My software went in production some days ago and now I want to argue a bit about the database structure.
The software collects data about ships, currently 174 details for each ship, each detail can be a text value, a long text value, a number (of a specified length, with or without a specified number of decimals), a date, a date with time, a boolean field, a menu with many values, a list of data and more.
I solved the problem with the following tables
Ship:
- ID - smallint, Autoincrement identity
- IMO - int, A number that does not change for the life of the ship
ShipDetailType:
- ID - smallint, Autoincrement identity
- Description - nvarchar(200), The description of the value the field contains
- Position - smallint, The position of the field in the data input form
- ShipDetailGroup_ID - smallint, A key to the group the field belongs to in the data input form
- Type - varchar(4), The type of the field as mentioned above
ShipDetailGroup
- ID - smallint, Autoincrement identity
(snip...)
ShipMenuPresetValue
- ID - smallint, Autoincrement identity
- ShipDetailType_ID - smallint, A key to the detail the values belongs to
- Value - nvarchar(100), The values preset in the menu type detail
ShipTextDetail
- ID - smallint, Autoincrement identity
- Ship_ID - smallint, A Key to the ship the detail belongs to
- ShipDetailType_ID - smallint, a Key to the detail type of the value
- Text - nvarchar(500), the field containing the detail's value
- ModifiedDate - smalldatetime
- User_ID - smallint, A key to the user table
ShipTextDetailHistory
(snip...)
This table is the same as the ShipTextDetail and contains every change to the details.
Other tables for the list detail type, each with the specified fields required for the list, ...
I just read this article: http://thedailywtf.com/Articles/The_Inner-Platform_Effect.aspx and http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:10678084117056
The articles says that this is not the right way to handle the problem.
My customer has a management gui for the details and groups as he changes the details descriptions and adds more details.
The data input form is dynamically built by reading the structure from the DetailGroups and DetailTypes, each detail type generates a specified input control.
The comments suggests that another way of solving this matter is dynamically creating and removing columns from the table.
What do you think?
Diagram Screenshot: http://img24.imageshack.us/my.php?image=66604496uk3.png
I would refactor your code if:
Your customer complained
You found something that didn't work
You found a way that the code couldn't handle a
change you knew was going to happen
in the future.
You remembered to write unit tests that will allow you to refactor, right?
*As far as the structure you have there, I've seen structures like it before. It's a little cumbersome but it is standard in many places. One thing to remember is that while its possible to dynamically add and remove columns from databases, the internal storage mechanism of the database doesn't necessarily expect you to be adding and removing these columns continuously. But I don't think this is very relevant compared to the points above, which boil down to: *Does it work?
I've seen this approach before and it's presented loads of performance issues once the data volume has grown. The kind of problems you'll encounter come when you need to return multiple items and use multiple criteria in your where clause. You join back and forth between Ship and ShipTextDetail to get all your select columns - maybe you have to do that 10/20 times ? You then do the same for your criteria maybe 2-3 times. Now you have a query with so many joins it runs really slowly. Next you 'pre-cook' some of the data to improve performance, ie you drag out common data into a fixed table structure - ah you've returned to a semi-normalised model.
My recommendation would be this - you know the information for 174 fields those are your core attributes. Your customer may add to that list, and may change the description of the fields, but it's still a really good starting point. Create a proper DataModel based around those, and then build in an extensability mechanism, as you have already done, but only for the new fields. The metadata - the descriptions of the fields, can reside in another table, or potentially in a resource file (useful for internationalisation?) and that gives some flexibility for existing fields.
I agree with Joe, you may not have problems if your DB is small, ie <1000 ships and your selects are simple. Although with 174 attributes to chose from this doesn't appear likely. I think you should change some of the 'obvious' fields first, ie I'd assume you have a Ship.Name, Ship.Owner, Ship.Weight, Ship.Registration ...
Good Luck.
I've done similar things, but there are a couple problems with this specific implementation:
You are storing numbers, booleans, dates, etc. as strings. This might be less than ideal. An alternative is to implement separate classes (inheriting from a base) for the different data types then store them in tables made for their data type.
Do the properties that you track change very frequently? Are they a different set per tanker? If not, it might be better to make objects rather than property bags to store all the data. Those objects can then be persisted to the database.
From a performance standpoint, either approach will be fine. How many ships could there possibly be? All the data is going to fit into RAM on any server.

Resources