Dynamic database structure - database

I would like some database/programming suggestion on a specific issue.
I have 5 different people (that live in different parts of the world) that provide me with data. This data is given to me in many variety of ways, following a standard structure layout. However it's not always harmonized, the data might have extra things that are not in the standard, so I'd like the structure to be as dynamic as possible to accommodate what the person wants to use.
These 5 data sources are then placed inside a central database I host. So basically I have 5 data sources that are formatted following a standard structure, and they are uploaded to my local database.
I want to automate the upload of this data as much as possible for the person providing the data, so I want them to upload new sets of data that are automatically inserted in my local db.
My questions are:
How should I keep the structure dynamic without having to revisit my standard layout to accommodate new fields of data, or different structure?
How do I make them upload data in a way that is incremental? For example they might be uploading an XML version of their data, my upload code should figure out what already exists.
My final and most important question. Are there better ways of going about this instead of having an upload infrastructure?

How should I keep the structure dynamic without having to revisit my standard layout to accommodate new fields of data, or different structure?
Basically, you pivot the normal database idea of columns and rows.
You have a data name table, which consists of the unique names of the fields of data, and an indicator to tell the import process what type of data is stored, like a date, timestamp, or integer.
You have a data table, which contains the data name id, a sequence number, the data field, and a foreign key to identifying information.
The sequence number is used to differentiate between different values of the same data name.
The data field holds every type of data possible. This would be a VARCHAR(MAX) in most databases. It's up to the upload process to convert dates and numbers to strings.
Your data table will have a foreign key to the rest of the information that identifies who the data field belongs to.
How do I make them upload data in a way that is incremental? For example they might be uploading an XML version of their data, my upload code should figure out what already exists.
The short answer is that you can't.
Your upload process has to identify duplicate data and not store it on the database.
My final and most important question. Are there better ways of going about this instead of having an upload infrastructure?
This is a hard question to answer without knowing more about the type of data you're receiving, but there is software that allows you to load databases without a lot of programming, by defining the input data structure and mapping that structure to your database tables.

This is a very general question, but I think I have a general answer. What I think solves your problem is to construct a new relational calculus where the properties attached to the master record are not pre-determined. Here is an example involving a phone book application.
Common method using a non-relational table:
Table PERSON has columns Name,
HomePhone, OfficePhone.
All well and good, but what do you do if the occasional person shows up with a mobile phone, more than one mobile phone, a fax phone, etc.
Instead what you do is:
Table Person has columns Person_ID,
Name.
Table Phones has columns Person_ID,
Phone_Type, PhoneNumber.
There is a one-to-many relationship between Person and Phones, and there can be any number of them from zero to a zillion. The tables are JOINed by Person_ID. You have to have business and presentation logic that enumerates the Phone_Type column (or just let it be free-form, which is not as useful but easier).
You can do that for any property, and is what relational data bases are all about. I hope this helps.

As others have said, EAV tables can handle dynamic structure. (be aware of performance issues on large tables)
But is it in your interest to have your database fields dictated by the client? You can't write business logic to act upon those new fields because they don't exist yet, they could be anything.
Can you force the client to conform to your model? This allows you to know the fields ahead of time and have business logic act upon the fields. It allows you to write meaningful reports as well, rather than just pivoted data dumps.

Related

Database, Using json field instead of ManyToManyField?

Suppose reviews can have zero or more tags.
One could implement this using three tables, Review/Tag/ReviewTagRelation.
ReviewTagRelation would have foreign key to Review and Tag table.
Or using two tables Review/Tag. Review has a json field to hold the list of tag ids.
Traditional approach seems to be the one using the three tables.
I wonder if it is ok to use the two tables approach when there's no need to reference reviews from tags.
i.e. I only need to know what tags are associated with a given review.
In my experience it is always best to keep the data in your database normalized, unless there is a clean and clear cut reason for not doing so that makes sense as per your business requirements.
With normalized data, you know that no matter what, you will always be able to write a query to receive exactly what you are looking for, and if for some reason you want to return data as json, you can do so in your select query.

Database Design - Normalization

This is a new topic to me, I've read a few articles and I'm still unclear even to the point where I'm unsure if the following question relates to the title of this post or not.
My system sends data to a user. The user may elect for the data to be sent by:
XML
Email
Post
Depending on what the user chooses several additional but different variables are required. Email address for example is required for sending the data via email but it isn't for sending it via XML.
Assuming we had a database table which stored 'Data Delivery Choices' (XML, Email, or Post), would it be best to store the additionally required variables in this table, meaning if XML was selected the email field would be empty in that row, or would it be better to make three new tables to store the vairables associated with each possible choice in the 'Data Delivery Choices' table and then associate entries in these tables via the 'Data Delivery Choices' PK?
Or doesn't it matter which way it's done?
For the purposes of the question forget the fact me may already hold the users email address etc... elsewhere.
Thanks!
Some RDBMS (PostgreSQL for example), allow for table inheritance. So you can define DataDeliveryChoices and then create 3 additional tables that inherit from it DataDeliveryXML, DataDeliveryEmail, ...
MSSQL allows you to store (and query) an XML document in a column, so you can store the additional data in XML (not classic database design, but very flexible should you need to add some data fields, without changing the schema).
Your way of adding three additional tables is IMO also an acceptable solution.
The possibilities are endless :)

Custom Fields for a Form representing an object

I have an architectural question concerning custom fields in a view for an object. Let's say you have a User Object with some basic information like firstname, lastname, ... that can be used by all customers.
Now, often we get a question from a customer to add couple of custom fields typical for their domain. Our solution now is an xml data column where key value pairs are stored. This has been ok so far, but now we'll have to find a more architectural solution.
For instance, now, a customer wants a dropdown where it can select the value for its custom field. We could still store the selected value in the xml data column, but where do we store all those dropdown values...
I know that in sharepoint you can also add custom fields like dropdowns and I was wondering how to deal with this best. I want to avoid creating custom tables for customers, or having a table with 90 columns (10 basic and then 10 for each customer), ...
You get the idea, it should be generic and be able to deal with all sorts of problems in the future.
What I was thinking about is a Table UserConfiguration where each record has a Foreign Key to the Customer (Channel in our database), then a column FieldName, a column FieldType and a column Values. The column values should be an xml type column, because for a dropdown, we'll need to add multiple values. Also, each value can have extra data attached to it (not just a name). The other problem then is how to store the selected value. I don't like the idea of having foreign keys to xml in my database (read somewhere that Azure can't handle this all to well). Do you just store the name of the value (what if the value were to disappear out of the xml?)?
Any documentation, links on this kind of problems would also be great. I'm trying to find a design pattern that deals with this kind of problem in the database.
I want to answer your question in two parts:
1) Implementing custom fields in a database server
2) Restricting custom fields to an enumeration of values
Although common solutions to 1) are discussed in the question referenced by #Simon, maybe you are looking for a bit of discussion on what the problem is and why it hasn't been solved for us already.
databases are great for structured, typed data
custom fields are inherently less structured
therefore, custom fields are more difficult to work with in a database
some or many of the advantages of using a database are lost
some queries may be more difficult or impossible
type safety may be lost (in the database)
data integrity may no longer be enforced (by the database)
it's a lot more work for the implementers and maintainers
As discussed in the other question, there's no perfect solution.
But these benefits/features still need to be implemented somewhere, and so often the application becomes responsible for data integrity and type safety.
For situations like these, people have created Object-Relation Mapping tools, although, as Jeff Atwood says, even using an ORM could create more problems than it solved. However, you mentioned that it 'should be generic and be able to deal with all sorts of problems in the future' -- this makes me think an ORM might be your best bet.
So, to sum up my answer, this is a known problem with known solutions, none of which are completely satisfactory (because it's so hard). Pick your poison.
To answer the second part of (what I think is) your question:
As mentioned in the linked question, you could implement Entity-Attribute-Value in your database for custom fields, and then add an extra table to hold the legal values for each entity. Then, the attribute/value of the EAV table is a foreign key into the attribute-value table.
For example,
CREATE TABLE `attribute_value` ( -- enumerations go in this table
`attribute` varchar(30),
`value` varchar(30),
PRIMARY KEY (`attribute`, `value`)
);
CREATE TABLE `eav` ( -- now the values of attributes are restricted
`entityid` int,
`attribute` varchar(30),
`value` varchar(30),
PRIMARY KEY (`entityid`, `attribute`),
FOREIGN KEY (`attribute`, `value`) REFERENCES `attribute_value`(`attribute`, `value`)
);
Of course, this solution isn't perfect or complete -- it's only supposed to illustrate the idea. For instance, it uses varchars, and lacks a type column. Also, who gets to decide what the possible values for each attribute are? Can these be changed at any time by the user?
I'm doing something similar for a customer. I've create a JSON FieldType which holds the entire JSON stream of a complex object and a String containing the FQTN (FullQualifiedTypeName) of my C# model class.
By using custom New-, Edit- and Display-Forms we'd ensured that our custom objects are rendered the correct way for best user experience.
To promote fields from the complex C# model to the SharePoint list, we've build something like Microsoft did in InfoPath. Users are able to select Properties or MetaData from the Complex C# type, which will be automatically promoted to the hosting SharePoint list.
The big advantage of JSON is, that its smaller than XML and easier to work with in the web world. (JavaScript...)
When you let the users create the data models, I would recommend looking at an document database or 'NoSQL' since you want exactly that, to store schemaless data structures.
Also, sharePoint stores metadata the way you mentioned (10 columns for text, 5 for dates etc)
That said, in my current project (locked in SharePoint, so Framework 3.5 + SQL Server and all the constraints that follow) we use a somewhat similar structure as below:
Form
Id
Attribute (or Field)
Name
Type (enum) Text, List, Dates, Formulas etc
Hidden (bool)
Mandatory
DefaultValue
Options (for lists)
Readonly
Mask (for SSN etc)
Length (for text fields)
Order
Metadata
FormId
AttributeId
Text (the value for everything but dates)
Date (the value for dates)
Our formulas employ functions such as Increment: INC([attribute1][attribute2], 6) and this would produce something like 000999 for the 999th instance of the combined values for attribute 1 and attribute 2 for a form, this is stored as:
AttributeIncrementFormula
AtributeId
Counter
Token
Other 'formulas' (aka anything non-trivial) such as barcodes are stored as single metadata values. In the actual implementation, we would have something like this:
var form = formRepository.GetById(1);
form.Metadata["firstname"].Value
Value above is a readonly property that decides whether we should get the value from Text or Date and if some additional transform is required. Note that the database here is merely a storage, we hold all the domain complexity in the application.
We also let our customer decide which attribute is the form title for example, so if firstname is the form title, they'll set an in-memory param that spans the entire application to be something like Params.InMemory.TitleAttributeId = <user-defined-id>.
I hope this gives you some insight on a production impl of a similar scenario.
This is really more of a comment than an answer, but I need more space than SO will allow for comments, so here 'tis:
I think your UserConfiguration table approach is good, and would suggest only abstracting the "type" and "value" pieces of your design a bit more:
Since your application will need to validate user input, each notion of "type" will have an associated piece of evaluation logic. Obviously the more of this you can abstract into data the easier it will be to keep your code small. Enumerated lists are a good start, but if your "validator" logic can be extended to handle pattern matching for text strings and Boolean logical expressions (e.g. to describe/enforce constraints on input values), then you can express pretty much any "type" of input that your application may need to handle in terms of (relatively) simple "atoms" that you can map naturally to DB tables.
When storing a user-specified value, you can either store the "raw" data (e.g. in JSON) and a foreign key to the associated "type", or you can add an lookup/cache system that assigns an integer to each new value that is encountered by the system ("novelty" can be checked by checking a hash of the "raw" data, for example). The latter approach obviously scales better if you're expecting lots of data duplication (which of course you would in the case of a multiple-choice menu).

Database design query

I'm trying to work out a sensible approach for designing a database where I need to store a wide range of continuously changing information about pets. The categories of data can be broken down into, for example, behaviour, illness etc. Data will be submitted on a regular basis relating to these categories, so i need to find a good way to design the db to efficiently accommodate this. A simple approach would just to store multiple records for each pet within each relevant table - e.g the behaviour table would store the behaviour data and would simply have a timestamp for each record along with the identifier for that pet. When querying the db, it would be straightforward to query the one table with the pet id, using the timestamps to output the correct history of submissions. Is there a more sensible way around this or does that make sense?
I would use a combination of lookup tables with a strong use of foreign keys. I think what you are suggesting is very common. For example, get me all the reported illnesses for a particluar pet during this data range would look something like:
Select *
from table_illness
where table_illness.pet_id = <value>
and date between table_illness.start_date and table_illness.finish_date
You could do that for any of the tables. The lookup tables will be a link between, for example, table_illness.illness_type and illness_types.illness_type. The illness_types table is where you would store the details on the types of illnesses.
When designing a database you should build your tables to mimic real-life objects or concepts. So in that sense the design you suggest makes sense. Each pet should have its own record in a pet table which doesn't change. Changing information should then be placed into the appropriate table which has the pet's id. The time stamp method you suggest is probably what I would do -- unless of course this is for a vet or something. Then I'd create an appointment table with the date and connect the illness or behavior to the appointment as well.

Table "Inheritance" in SQL Server

I am currently in the process of looking at a restructure our contact management database and I wanted to hear peoples opinions on solving the problem of a number of contact types having shared attributes.
Basically we have 6 contact types which include Person, Company and Position # Company.
In the current structure all of these have an address however in the address table you must store their type in order to join to the contact.
This consistent requirement to join on contact type gets frustrating after a while.
Today I stumbled across a post discussing "Table Inheritance" (http://www.sqlteam.com/article/implementing-table-inheritance-in-sql-server).
Basically you have a parent table and a number of sub tables (in this case each contact type). From there you enforce integrity so that a sub table must have a master equivalent where it's type is defined.
The way I see it, by this method I would no longer need to store the type in tables like address, as the id is unique across all types.
I just wanted to know if anybody had any feelings on this method, whether it is a good way to go, or perhaps alternatives?
I'm using SQL Server 05 & 08 should that make any difference.
Thanks
Ed
I designed a database just like the link you provided suggests. The case was to store the data for many different technical reports. The number of report types is undefined and will probably grow to about 40 different types.
I created one master report table, that has an autoincrement primary key. That table contains all common information like customer, testsite, equipmentid, date etc.
Then I have one table for each report type that contains the spesific information relating to that report type. That table have the same primary key as the master and references the master as well.
My idea for splitting this into different tables with a 1:1 relation (which normally would be a no-no) was to avoid getting one single table with a huge number of columns, that gets very difficult to maintain as your constantly adding columns.
My design with table inheritance gave me segmented data and expandability without beeing difficult to maintain. The only thing I had to do was to write special a special save method to handle writing to two tables automatically. So far I'm very happy with the design and haven't really found any drawbacks, except for a little more complicated save method.
Google on "gen-spec relational modeling". You'll find a lot of articles discussing exactly this pattern. Some of them focus on table design, while others focus on an object oriented approach.
Table inheritance pops up in a few of them.
I know this won't help much now, but initially it may have been better to have an Entity table rather than 6 different contact types. Then each Entity could have as many addresses as necessary and there would be no need for type in the join.
You'll still have the problem that if you want the sub-type fields and you have only the master contact, you'll have to know what table to go looking at - or else join to all of them. But otherwise this is a workable solution to a common problem.
Another possibility (fairly similar in structure, but different in how you think of it) is to simply put all your contacts into one table. Then for the more specific fields (birthday say for people and department for position#company) create separate tables that are associated with that contact.
Contact Table
--------------
Name
Phone Number
Address Table
-------------
Street / state, etc
ContactId
ContactBirthday Table
--------------
Birthday
ContactId
Departments Table
-----------------
Department
ContactId
It requires a different way of thinking of things though - instead of thinking of people vs. companies, you think of the various functional requirements for the task at hand - if you want to send out birthday cards, get all the contacts that have birthdays associated with them, etc..
I'm going to go out on a limb here and suggest you should rethink your normalization strategy (as you seem to be lucky enough to be able to rethink your schema quite fundamentally). If you typically store an address for each contact, then your contact table should have the address fields in it. Alternatively if the address is stored per company then the address should be stored in the company table and your contacts linked to that company.
If your contacts only have one address, or one (or even 3, just not 'many') instance of the other fields, think about rationalizing them into a single table. In my experience having a few null fields is a far better alternative than needing left joins to data you aren't sure exists.
Fortunately for anyone who vehemently disagrees with me you did ask for opinions! :) IMHO you should only normalize when you really need to. Where you are rethinking schemas, denormalization should be considered at every opportunity.
When you have a 7th type, you'll have to create another table.
I'm going to try this approach. Yes, you have to create new tables when you have a new type, but since this table will probably have different columns, you'll end up doing this anyway if you don't use this scheme.
If the tables that inherit the master don't differentiate much from one another, I'd recommend you try another approach.
May I suggest that we just add a Type table. Ie a person has an address, name etc then the student, teacher as each use case presents its self we have a PersonType table that has an entry from the person table to n types and the subsequent new tables teacher, alien, singer as the system eveolves...

Resources