Database Design - Normalization

Database Design - Normalization - database

This is a new topic to me, I've read a few articles and I'm still unclear even to the point where I'm unsure if the following question relates to the title of this post or not.
My system sends data to a user. The user may elect for the data to be sent by:
XML
Email
Post
Depending on what the user chooses several additional but different variables are required. Email address for example is required for sending the data via email but it isn't for sending it via XML.
Assuming we had a database table which stored 'Data Delivery Choices' (XML, Email, or Post), would it be best to store the additionally required variables in this table, meaning if XML was selected the email field would be empty in that row, or would it be better to make three new tables to store the vairables associated with each possible choice in the 'Data Delivery Choices' table and then associate entries in these tables via the 'Data Delivery Choices' PK?
Or doesn't it matter which way it's done?
For the purposes of the question forget the fact me may already hold the users email address etc... elsewhere.
Thanks!

Some RDBMS (PostgreSQL for example), allow for table inheritance. So you can define DataDeliveryChoices and then create 3 additional tables that inherit from it DataDeliveryXML, DataDeliveryEmail, ...
MSSQL allows you to store (and query) an XML document in a column, so you can store the additional data in XML (not classic database design, but very flexible should you need to add some data fields, without changing the schema).
Your way of adding three additional tables is IMO also an acceptable solution.
The possibilities are endless :)

Related

How to deal with similar fields across SQL Server database tables in a dimension

I am working on a data warehouse solution, and I am trying to build a dimensional model from tables held in a SQL Server database. Some of the tables include but aren't limited to Customer, Customer Payments, Customer Address, etc.
All these tables in the DB have some fields that are repeated multiple times across each table i.e. Record update date, record creatuin date, active flag, closed flag and a few others. These tables all relate to the Customer in some way, but the tables can be updated independently.
I am in the process of building out a dimension(s) on the back of these tables, but I am struggling to see how best to deal with these repeated fields in an elegant way, as they are all used.
I'll appreciate any guidance from people who have experience with scenarios like this, as I ammjust starting out
If more details are needed, I am happy to provide
Thanks

Before you even consider how to include them, ask if those metadata fields even need to be in your dimensional model? If no one will use the Customer Payment Update Date (vs Created Date or Payment Date), don't bring it into your model. If the customer model includes the current address, you won't need the CustomerAddress.Active flag included as well. You don't need every OLTP field in your model.
Make notes about how you talk about the fields in conversation. How do you identify the current customer address? Check the CurrentAddress flag (CustomerAddress.IsActive). When was the Customer's payment? Check the Customer Payment Date (CustomerPayment.PaymentDate or possibly CustomerPayment.CreatedDate). Try to describe them in common language terms. This will provide the best success in making your model discoverable by your users and intuitive to use.
Naming the columns in the model and source as similar as possible will also help with maintenance and troubleshooting.
Also, make sure you delineate the entities properly. A customer payment would likely be in a separate dimension from the customer. The current address may be in customer, but if there is any value to historical address details, it may make sense to put it into its own dimension, with the Active flag as well.

SQLite: Individual tables per user or one table for them all?

I've already designed a website which uses an SQLite database. Instead of using one large table, I've designed it so that when a user signs up, a individual table is created for them. Each user will possibly use several hundreds of records. I done this because I thought it would be easier to structure and access.
I found on other questions on this site that one table is better than using many tables for each user.
Would it be worth redesigning my site so that instead of having many tables, there would be one large table? The current method of mine seems to work well though it is still in development so I'm not sure how well it would stack up in a real environment.
The question is: Would changing the code so that there is one large database instead of many individual ones be worth it in terms of performance, efficiency, organisation and space?
SQLite: Creating a user's table.
CREATE TABLE " + name + " (id INTEGER PRIMARY KEY, subject TEXT, topic TEXT, questionNumber INTEGER, question TEXT, answer TEXT, color TEXT)
SQLite: Adding an account to the accounts table.
"INSERT INTO accounts (name, email, password, activated) VALUES (?,?,?,?)", (name, email, password, activated,)
Please note that I'm using python with Flask if it makes any difference.
EDIT
I am also aware that there are questions like this already, however none state whether the advantages or disadvantages will be worth it.

In an object oriented language, would you make a class for every user? Or would you have an instance of a class for each user?
Having one table per user is a really bad design.
You can't search messages based on any field that isn't the username. With your current solution, how would you find all messages for a certain questionNumber?
You can't join with the messages tables. You have to make two queries, one to find the table name and one to actually query the table, which requires two round-trips to the database server.
Each user now has their own table schema. On an upgrade, you have to apply your schema migration to every messages table, and God help you if some of the tables are inconsistent with the rest.
It's effectively impossible to have foreign keys pointing to your messages table. You can't specify the table that the foreign key column points to, because it won't be the same.
You can have name conflicts with your current setup. What if someone registers with the username accounts? Admittedly, this is easy to fix by adding a user_ prefix, but still something to keep in mind.
SQL injection vulnerabilities. What if I register a user named lol; DROP TABLE accounts; --? Query parameters, the primary way of preventing such attacks, don't work on table names.
I could go on.
Please merge all of the tables, and read up on database normalization.

Is it a good idea to create a db with a generic table entity that can be decorated with a role and metadatas?

I've been thinking about creating a database that, instead of having a table per object I want to represent, would have a series of generic tables that would allow me to represent anything I want and even modifying (that's actually my main interest) the data associated with any kind of object I represent.
As an example, let's say I'm creating a web application that would let people make appointments with hairdressers. What I would usually do is having the following tables in my database :
clients
hairdressers: FK: id of the company the hairdresser works for
companies
appointments: FK: id of the client and the hairdresser for that appointment
But what happens if we deal with scientific hairdressers that want to associate more data to an appointment (e.g. quantity of shampoo used, grams of hair cut, number of scissor's strokes,...) ?
I was thinking instead of that, I could use the following tables:
entity: represents anything I want. PK(entity_id)
group: is an entity (when I create a group, I first create an entity which
id is then referred to by the FK of the group). PK(group_id), FK(entity_id)
entity_group: each group can contain multiple entity (thus also other groups): PK(entity_id, group_id).
role: e.g. Administrator, Client, HairDresser, Company. PK(role_id)
entity_role: each entity can have multiple roles: PK(entity_id, role_id)
metadata: contains the name and type of the metadata aswell as the associated role and a flag that describes if its mandatory or not. PK(metadata_id), FK(metadata_type_id, role_id)
metadata_type: contains information about available metadata types. PK(metadata_type_id)
metadata_value: PK(metadata_value_id), FK(metadata_id)
metadata_: different tables for the different types e.g. char, text, integer, double, datetime, date. PK(metadata__id), FK(metadata_value_id) which contain the actual value of a metadata associated with an entity.
entity_metadata: contains data associated with an entity e.g. name of a client, address of a company,... PK(entity_id, metadata_value_id). Using the type of the metadata, its possible to select the actual value of a metadata for this entity in the corresponding table.
This would allow me to have a completely flexible data structure but has a few drawbacks:
Selecting the metadatas associated with an entity returns multiple rows that I have to process in my code to create the representation of the entity in my code.
Selecting metadatas of multiple entities requires to loop over the same process as above.
Selecting metadatas will also require me to do a select for each one of the metadata_* table that I have.
On the other hand, it has some advantages. For example, instead of having a client table with a lot of fields that will almost never be filled, I just use the exact number of rows that I need.
Is this a good idea at all?
I hope that I've expressed clearly what I'm trying to achieve. I guess that I'm not the first one who wonders how to achieve that but I was not able to find the right keywords to find an answer to that question :/
Thanks!

Dynamic database structure

I would like some database/programming suggestion on a specific issue.
I have 5 different people (that live in different parts of the world) that provide me with data. This data is given to me in many variety of ways, following a standard structure layout. However it's not always harmonized, the data might have extra things that are not in the standard, so I'd like the structure to be as dynamic as possible to accommodate what the person wants to use.
These 5 data sources are then placed inside a central database I host. So basically I have 5 data sources that are formatted following a standard structure, and they are uploaded to my local database.
I want to automate the upload of this data as much as possible for the person providing the data, so I want them to upload new sets of data that are automatically inserted in my local db.
My questions are:
How should I keep the structure dynamic without having to revisit my standard layout to accommodate new fields of data, or different structure?
How do I make them upload data in a way that is incremental? For example they might be uploading an XML version of their data, my upload code should figure out what already exists.
My final and most important question. Are there better ways of going about this instead of having an upload infrastructure?

How should I keep the structure dynamic without having to revisit my standard layout to accommodate new fields of data, or different structure?
Basically, you pivot the normal database idea of columns and rows.
You have a data name table, which consists of the unique names of the fields of data, and an indicator to tell the import process what type of data is stored, like a date, timestamp, or integer.
You have a data table, which contains the data name id, a sequence number, the data field, and a foreign key to identifying information.
The sequence number is used to differentiate between different values of the same data name.
The data field holds every type of data possible. This would be a VARCHAR(MAX) in most databases. It's up to the upload process to convert dates and numbers to strings.
Your data table will have a foreign key to the rest of the information that identifies who the data field belongs to.
How do I make them upload data in a way that is incremental? For example they might be uploading an XML version of their data, my upload code should figure out what already exists.
The short answer is that you can't.
Your upload process has to identify duplicate data and not store it on the database.
My final and most important question. Are there better ways of going about this instead of having an upload infrastructure?
This is a hard question to answer without knowing more about the type of data you're receiving, but there is software that allows you to load databases without a lot of programming, by defining the input data structure and mapping that structure to your database tables.

This is a very general question, but I think I have a general answer. What I think solves your problem is to construct a new relational calculus where the properties attached to the master record are not pre-determined. Here is an example involving a phone book application.
Common method using a non-relational table:
Table PERSON has columns Name,
HomePhone, OfficePhone.
All well and good, but what do you do if the occasional person shows up with a mobile phone, more than one mobile phone, a fax phone, etc.
Instead what you do is:
Table Person has columns Person_ID,
Name.
Table Phones has columns Person_ID,
Phone_Type, PhoneNumber.
There is a one-to-many relationship between Person and Phones, and there can be any number of them from zero to a zillion. The tables are JOINed by Person_ID. You have to have business and presentation logic that enumerates the Phone_Type column (or just let it be free-form, which is not as useful but easier).
You can do that for any property, and is what relational data bases are all about. I hope this helps.

As others have said, EAV tables can handle dynamic structure. (be aware of performance issues on large tables)
But is it in your interest to have your database fields dictated by the client? You can't write business logic to act upon those new fields because they don't exist yet, they could be anything.
Can you force the client to conform to your model? This allows you to know the fields ahead of time and have business logic act upon the fields. It allows you to write meaningful reports as well, rather than just pivoted data dumps.

Any simple approaches for managing customer data change requests for global reference files?

For the first time, I am developing in an environment in which there is a central repository for a number of different industry standard reference data tables and many different customers who need to select records from these industry standard reference data tables to fill in foreign key information for their customer specific records.
Because these industry standard reference files are utilized by all customers, I want to reserve Create/Update/Delete access to these records for global product administrators. However, I would like to implement a (semi-)automated interface by which specific customers could request record additions, deletions or modifications to any of the industry standard reference files that are shared among all customers.
I know I need something like a "data change request" table specifying:
user id,
user request datetime,
request type (insert, modify, delete),
a user entered text explanation of the change request,
the user request's current status (pending, declined, completed),
admin resolution datetime,
admin id,
an admin entered text description of the resolution,
etc.
What I can't figure out is how to elegantly handle the fact that these data change requests could apply to dozens of different tables with differing table column definitions. I would like to give the customer users making these data change requests a convenient way to enter their proposed record additions/modifications directly into CRUD screens that look very much like the reference table CRUD screens they don't have write/delete permissions for (with an additional text explanation and perhaps request priority field). I would also like to give the global admins a tool that allows them to view all the outstanding data change requests for the users they oversee sorted by date requested or user/date requested. Upon selecting a data change request record off the list, the admin would be directed to another CRUD screen that would be populated with the fields the customer users requested for the new/modified industry standard reference table record along with customer's text explanation, the request status and the text resolution explanation field. At this point the admin could accept/edit/reject the requested change and if accepted the affected industry standard reference file would be automatically updated with the appropriate fields and the data change request record's status, text resolution explanation and resolution datetime would all also be appropriately updated.
However, I want to keep the actual production reference tables as simple as possible and free from these extraneous and typically null customer change request fields. I'd also like the data change request file to aggregate all data change requests across all the reference tables yet somehow "point to" the specific reference table and primary key in question for modification & deletion requests or the specific reference table and associated customer user entered field values in question for record creation requests.
Does anybody have any ideas of how to design something like this effectively? Is there a cleaner, simpler way I am missing?

Option 1
If preserving the base tables is important then I would create a "change details" table as a child to your change request table. I'm envisioning something like
ChangeID
TableName
TableKeyValue
FieldName
ProposedValue
Add/Change/Delete Indicator
So you'd have a row in this table for every proposed field change. The challenge in this scenario is maintaining the mapping of TableName and FieldName values to the actual tables and fields. If your database structure if fairly static then this may not be an issue.
Option 2
Add a ChangeID field to each of your base tables. When a change is proposed add a record to the base table with the ChangeID populated. So as an example if you have a Company table, for a single company you could have multiple records:
CompanyCode ChangeID CompanyName CompanyAddress
----------- -------- ----------- --------------
COMP1 My Company Boston <-- The "live" record
COMP1 1 New Name Boston <-- A proposed change
When the admin commits the change the existing live record is deleted or archived and the ChangeID value is removed from the proposed record making it the live record. It may be a little tricky to handle proposed deletions with this option. This option also has the potential for impacting performance of selecting live data for normal usage. However it does save you the hassle of maintaining a list of table names and field names somewhere in your code.
I'm sure others will have some opinions!

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight