The problem I am trying to solve is not an overly complicated one but one that I would like to try and solve more elegantly than I currently am.
Problem:
We do business with multiple companies. For the sake of argument lets say each company produces motor vehicles. Each company has a differing implementation (i.e. data that must be persisted into a database). When a customer orders a car, you have no way of knowing what type of car they might buy so it is desirable to have a single looking table called 'Vehicles' that establishes the relationship between the CustomerId, a unique VehicleId, internal to our database, globally unique and some sort of composite key which would be unique in one of the many CompanyX_Vehicle tables.
An example would be:
Top level lookup table:
VehicleId
CustomerId
CompanyId
CompanyVehicleId
CompanyAVehicle Table:
CompanyAVehicleId ------> Part of composite key
CompanyId ------> Part of composite key
...... unique implementation and persistence requirements.
CompanyBVehicle Table:
CompanyBVehicleId ------> Part of composite key
CompanyId ------> Part of composite key
...... unique implementation and persistence requirements.
I have to disable foreign key enforcement for obvious reasons however in code (in this case C#, EF), I can perform a single query and eagerly include the necessary data from the correct CompanyXVehicle table.
Alternatively, I can omit any kind of relationship and just perform two queries each and every time, one to get the company and companyvehicle ID's and then make a call into the necessary table.
However I have a feeling there is a better alternative to either of these solutions. Does anyone have a suggestion on how to tackle this particular problem?
I'll put an answer........so this can be closed out (eventually and if no one else puts a better answer).
While there are several ways to do this, I prefer the DRY method.
Which is :
Base Class/Table has the PK and all the scalars that are the same.
A different sub-class(table) for the different "types" of entities. This would have the scalars that are unique to the type.
Animal(Table)
AnimalSurrogateKey (int or guid)
Species (lookup table FK int)
Birthddate (datetime, null)
Dog(Table)
ParentAnimalSurrogateKey (int) PK,FK
BarkDecibels (int)
Snake(Table)
ParentAnimalSurrogateKey (int) PK,FK
ScaleCount (int)
Something like that.
ORMs can handle this. Hand/Manual ORM can handle it....
You can query for general information about the "Animals".
Or you'll have multiple queries to get all the sub-type information.
If you needed to query about basic information about just Dogs, it would be
Select AnimalSurrogateKey , Species , Birthdate from dbo.Animal a where exists (select null from dbo.Dog d where d.ParentAnimalSurrogateKey = a.AnimalSurrogateKey )
..........
The key is to follow an established "pattern" for dealing with these scenarios. Most problems have already been thought out........and a future developer will thank you for mentioning in the comments/documentation "we implemented the blah blah blah pattern to implement".
APPEND:
(using info from http://www.sqlteam.com/article/implementing-table-inheritance-in-sql-server)
That is a great article going through the scenarios. Again, you'll have to judge if EF is "good enough" or not........if it isn't, then you can manually do your ORM ... and (to get around the multiple queries concern) ... maybe test a query like this .........
select p.PersonID , p.PersonTypeID , s.EnrollmentDate , t.HireDate , par.DifficultyScore
from dbo.People p
left join dbo.Students s on p.PersonID = s.PersonID
left join dbo.Teachers t on p.PersonID = t.PersonID
left join dbo.Parents par on p.PersonID = par.PersonID
And then you can manually do your ORM to "switch/case" off of PersonTypeID, and create your subclasses using the data unique to each subclass (noting that the rows where the type is off, you will have null values.......ex: if your subtype is "Student", then par.DifficultyScore will be null for that row. )
At some point, you're gonna have to POC (proof of concept) your choice. You have a problem, there are a handful number of ways to deal with it....you have to test it. EF may be good enough..it may not be. Thus why I go POCO first...so I can go with ado.net/old-school/idatareaders if EF isn't performing well enough.
Good luck.
Related
I have a question regarding data modelling. Suppose I have following tables 3 student tables. Source_table1 contains A_ID as primary key and Name as an attribute. Source_table2 has B_ID as Primary key and Name & Address as other attributes.Source_table3 has C_ID as Primary key and Name, Address and Age as attributes. If we want to create a new table as Student Master with all the records in that table, how can we do that? If we are creating a cross reference table then how should we approach that problem?
Integrating data from different sources is complicated. In the end, you want to end up with something like:
student (student_id PK, name, address, source1_id, source2_id, source3_id)
However, there are some issues to resolve to get there.
Identity
How will you identify matching records in the different sources? It looks like your sources use surrogate identifiers, but those have no meaning outside the context of the source databases. What you're looking for is a suitable natural key. The only common denominator among the sources is a student's name, but names are notoriously poor identifiers.
It can be useful to actually test the data rather than assume it will or won't work. For example, a query such as:
SELECT s1.name, COUNT(*) AS amount
FROM student_source_1 s1
INNER JOIN student_source_2 s2 ON s1.name = s2.name
GROUP BY s1.name
HAVING COUNT(*) > 1
repeated for (student_source_2, student_source_3) and (student_source_1, student_source_3) should give you some insight into the size of the problem.
You could match student_source_2 and student_source_3 based on both name and address. That might give better results, or worse if the two sources have different addresses (or spellings thereof) for the same student. That brings us to our second concern:
Inconsistency
Assuming you can resolve the identity problem, you may need to deal with inconsistent data. What if sources 2 and 3 have different addresses for the same student? How do you determine the correct address?
In some cases, it could be sufficient to just map the sources without resolving inconsistencies.
Winging it in the real world
One technique I use on harder cases is to build a mapping table by hand, e.g.
student_map (student_id PK, source1_id, source2_id, source3_id)
Each of the source_id columns should have a unique constraint, and usually all 3 will be nullable. This is a first step toward the student table above.
I would start by inserting all the perfect 1-to-1 matches, then left join each of the sources with the mapping table to get the unmatched records. Having the unmatched source records side-by-side and sorted makes it easy to visually spot likely matches. It's tedious and error-prone work, but sometimes it must be done regardless. For inconsistencies I might choose the most complete/best looking source as base, and fill in the gaps from the other sources. If you can involve teachers or people who are familiar with the actual students, or present them with alternatives to choose from, by all means do so.
More data can be extremely useful. If the sources have social security numbers, family information, etc, these can be used to match students. I would use any number of queries to find perfect matches among various pieces of information, and insert those into the mapping table, before doing the side-by-side matching.
You may well find that a source has internal consistency problems due to poor design - e.g. multiple records for the same student. This may require fixing the source data before continuing.
A good understanding of the relational model of data is invaluable for this kind of work, since you'll be identifying candidate keys, following dependencies and encountering anomalies.
I have a need to track some history for a table that contains ids from other tables:
I want to track the status of the company_device table such that I can make entries to know when the status of the relationship changed (when a device was assigned to a company, and when it was unassigned, etc). The company_device table would only contain current, existing relationships. So I'd like to do 'something' like this:
But this won't work, because it requires there to be a record in company_device for the FK to be satisfied in the company_device_history table. For example, if I
insert into company_device values (1,1);
insert into company_device_history values (1,1,'Assigned',now());
Then I can't ever remove the record from company_device because of the foreign key constraint. So I've currently settled on this:
so I'm not restricted by the foreign key.
My question is : is there a better way to model this? I could add the status and effective_date to the company_device table and query based on status or effective_date, but that doesn't seem to be a good solution to me. I'd like to know how others might approach this.
When looking exclusively at the problem (that is, when modeling the nature of the business problem at hand), the only thing you need is one single table COMPANY_DEVICE_ASSIGNMENT with four columns C_ID, D_ID, FROM and TO, telling you that "device D_ID was assigned to company C_ID from FROM to TO".
Systems do exist that allow you to work on this basis, however none of them speak SQL (an excellent book with an in-depth treatment of the subject matter of temporal data, I'd even call it the "canonical" work, is "Time and Relational Theory - Googling for it can't miss). If you do this in an SQL system, this "straightforward" approach is not going to get you very far. In that case, your practical options are limited by :
what temporal features are offered by the DBMS you want/can/must use
what features are supported by whatever modeling tool you want/can/must use to generate DDL.
As Neil's comment stated : the most recent version of the SQL standard had "temporal support" as its main novelty, and they are absolutely relevant to your problem, but the products actually offering these features are relatively few and far between.
What would be the best way to do this and why?
Here is a quick look at a part of my database design, I'm looking for the best way to organize this data.
"Leads" has many "Students", Leads has many "Contacts"
"Students" belongs to "Leads" and belongs to "People"
id, person_id, lead_id
"Contacts" belongs to "Leads" and belongs to "People"
id, person_id, lead_id
I want to be able to signify which contact is going to be a "payer" and if a contact would be the primary contact or not.
I thought originally I would add two more tables like this:
"PrimaryContacts" belongs to "Contacts"
id, contact_id
"Payer" belongs to "Contacts"
id, contact_id
Then I realized it seems kind of over kill to add two more tables with something I can easily represent in the initial Contacts table like this
"Contacts"
id, person_id, lead_id, type, payer
Then I could have type be 1 or 2, meaning primary or secondary, and then the payer field would be 1 or 2 meaning they either are paying or they aren't.
Is there a benefit of doing it one way or the other or does it matter at all?
Thanks!
I have to admit I'm a little confused by your requirements, but interpreting literally what you say seems to lead to the following database model:
The Contacts.payer flag enables you to have any number of payers, regardless of their primary status.
There really is no need for a separate Payer table in this case.
The Leads.primary_contact_id is a NULL-able FK towards the Contacts, which is what lets you have 0 or 1 primary contact per lead (to avoid the possibility of 0 primary contacts, you'd need a NOT NULL, but this would lead to an insertion cycle, which would have to be resolved through deferred constraints, which are not supported in MySQL).
However, this doesn't guarantee that the primary contact belongs to its own lead (i.e. Contacts.lead_id could be different from Leads.lead_id even when Contacts.contact_id matches Leads.contact_id). Is that a problem? If yes, you'd need a liberal application of identifying relationships and composite PKs, which could be a problem for ORM.
Separate PrimaryContacts table would have a very similar effect to the Leads.primary_contact_id (assuming you got your PK right), and would even have the same problem of allowing 0 primary contacts and lead mismatches. Just having a "backward" FK is simpler and more efficient from the database perspective (though I'm not sure if that's still true from the CakePHP perspective).
Unfortunately, I'm not familiar with CakePHP - hopefully you'll be able to "translate" this model there on your own.
I understand how to design a database schema that has simple one-to-many relationships between its tables. I would like to know what the convention or best practice is for designating one particular relationship in that set as the primary one. For instance, one Person has many CreditCards. I know how to model that. How would I designate one of those cards as the primary for that person? Solutions I have come up with seem inelegant at best.
I'll try to clarify my actual situation. (Unfortunately, the actual domain would just confuse things.) I have 2 tables each with a lot of columns, let's say Person and Task. I also have Project which has only a couple of properties. One Person has many Projects, but has a primary Project. One Project has many Tasks, but sometimes has one primary Task with alternates, and other times has no primary task and instead a sequence of Tasks. There are no Tasks that are not part of a Project, but it isn't strictly forbidden.
PERSON (PERSON_ID, NAME, ...)
TASK (TASK_ID, NAME, DESC, EST, ...)
PROJECT (NAME, DESC)
I can't seem to figure a way to model the primary Project, primary Task, and the Task sequence all at the same time without introducing either overcomplexity or pure evil.
This is the best I've come up with so far:
PERSON (PERSON_ID, NAME, ...)
TASK (TASK_ID, NAME, DESC, EST, ...)
PROJECT (PROJECT_ID, PERSON_FK, TASK_FK, INDEX, NAME, DESC)
PERSON_PRIMARY_PROJECT (PERSON_FK, PROJECT_FK)
PROJECT_PRIMARY_TASK (PROJECT_FK, TASK_FK)
It just seems like too many tables for a simple concept.
Here's a question I've found that deals with a very similar situation: Database Design: Circular dependency.
Unfortunately, there didn't seem to be a consensus about how to handle the situation, and the "correct" answer was to disable the database consistency checking mechanism. Not cool.
Well, it seems to me that a Person has two relationships with a CreditCard. One is that the person owns it, and the other is that they consider it their primary CreditCard. That tells me you have a one-to-one and a one-to-many relationship. The return relationship for the one-to-one is already in the CreditCard because of the one-to-many relationship its in.
This means I'd add primary_cc_id as a field in Person and leave CreditCard alone.
Two strategies:
Use a bit column to indicate the preffered card.
Use a PrefferedCardTable associating each Person with the ID of its preffered card.
One person can have many credit cards; Then you'd need an identifier on each credit card to actually link that specific credit card to one individual - which I assume you've already made in your model (some kind of ID that links the person to that credit card).
Primary credit card (I assume you mean e.g. as a default credit card?) That would have to be some sort of manual operation (e.g. that you have a third table, that links them together and a column that specifies which one would be the default).
Person (SSN, Name)
CreditCard (CCID, AccountNumber)
P_CC (SSN, CCID, NoID)
So that would mean that if you connect a person to a credit card, you'd have to specify the NoID, as say '1', then design your query to per default find the credit card that belongs to this individual with NoID '1'.
This is of course just one way of doing it, maybe you'd want to limit by 0, 1 - and then sort them by the date the credit card was added to that person.
Maybe if you'd elaborate and give more information about your columns and ideas it'd make it easier.
So here what I tried out with Northwind and C# Windows App ,and I had just one query executed.
My Code:
DataClasses1DataContext context = new DataClasses1DataContext();
DataLoadOptions dlo = new DataLoadOptions();
dlo.LoadWith<Product>(b => b.Category);
context.LoadOptions = dlo;
context.DeferredLoadingEnabled = false;
context.Log = Console.Out;
List<Product> test = context.Products.ToList();
MessageBox.Show(test[0].Category.CategoryName);
Result:
SELECT [t0].[ProductID], [t0].[ProductName], [t0].[SupplierID], [t0].[CategoryID], [t0].[QuantityPerUnit], [t0].[UnitPrice], [t0].[UnitsInStock], [t0].[UnitsOnOrder], [t0].[ReorderLevel], [t0].[Discontinued], [t2].[test], [t2].[CategoryID] AS [CategoryID2], [t2].[CategoryName], [t2].[Description], [t2].[Picture]
FROM [dbo].[Products] AS [t0]
LEFT OUTER JOIN (
SELECT 1 AS [test], [t1].[CategoryID], [t1].[CategoryName], [t1].[Description], [t1].[Picture]
FROM [dbo].[Categories] AS [t1]
) AS [t2] ON [t2].[CategoryID] = [t0].[CategoryID]
For a database assignment I have to model a system for a school. Part of the requirements is to model information for staff, students and parents.
In the UML class diagram I have modelled this as those three classes being subtypes of a person type. This is because they will all require information on, among other things, address data.
My question is: how do I model this in the database (mysql)?
Thoughts so far are as follows:
Create a monolithic person table that contains all the information for each type and will have lots of null values depending on what type is being stored. (I doubt this would go down well with the lecturer unless I argued the case very convincingly).
A person table with three foreign keys which reference the subtypes but two of which will be null - in fact I'm not even sure if that makes sense or is possible?
According to this wikipage about django it's possible to implement the primary key on the subtypes as follows:
"id" integer NOT NULL PRIMARY KEY REFERENCES "supertype" ("id")
Something else I've not thought of...
So for those who have modelled inheritance in a database before; how did you do it? What method do you recommend and why?
Links to articles/blog posts or previous questions are more than welcome.
Thanks for your time!
UPDATE
Alright thanks for the answers everyone. I already had a separate address table so that's not an issue.
Cheers,
Adam
4 tables staff, students, parents and person for the generic stuff.
Staff, students and parents have forign keys that each refer back to Person (not the other way around).
Person has field that identifies what the subclass of this person is (i.e. staff, student or parent).
EDIT:
As pointed out by HLGM, addresses should exist in a seperate table, as any person may have multiple addresses. (However - I'm about to disagree with myself - you may wish to deliberately constrain addresses to one per person, limiting the choices for mailing lists etc).
Well I think all approaches are valid and any lecturer who marks down for shoving it in one table (unless the requirements are specific to say you shouldn't) is removing a viable strategy due to their own personal opinion.
I highly recommend that you check out the documentation on NHibernate as this provides different approaches for performing the above. Which I will now attempt to poorly parrot.
Your options:
1) One table with all the data that has a "delimiter" column. This column states what kind of person the person is. This is viable in simple scenarios and (seriously) high performance where the joins will hurt too much
2) Table per class which will lead to duplication of columns but will avoid joins again, so its simple and a lil faster (although only a lil and indexing mitigates this in most scenarios).
3) "Proper" inheritence. The normalised version. You are almost there but your key is in the wrong place IMO. Your Employee table should contain a PersonId so you can then do:
select employee.id, person.name from employee inner join person on employee.personId = person.personId
To get all the names of employees where name is only specified on the person table.
I would go for #3.
Your goal is to impress a lecturer, not a PM or customer. Academics tend to dislike nulls and might (subconciously) penalise you for using the other methods (which rely on nulls.)
And you don't necessarily need that django extension (PRIMARY KEY ... REFERENCES ...) You could use an ordinary FOREIGN KEY for that.
"So for those who have modelled inheritance in a database before; how did you do it? What method do you recommend and why?
"
Methods 1 and 3 are good. The differences are mostly in what your use cases are.
1) adaptability -- which is easier to change? Several separate tables with FK relations to the parent table.
2) performance -- which requires fewer joins? One single table.
Rats. No design accomplishes both.
Also, there's a third design in addition to your mono-table and FK-to-parent.
Three separate tables with some common columns (usually copy-and-paste of the superclass columns among all subclass tables). This is very flexible and easy to work with. But, it requires a union of the three tables to assemble an overall list.
OO databases go through the same stuff and come up with pretty much the same options.
If the point is to model subclasses in a database, you probably are already thinking along the lines of the solutions I've seen in real OO databases (leaving fields empty).
If not, you might think about creating a system that doesn't use inheritance in this way.
Inheritance should always be used quite sparingly, and this is probably a pretty bad case for it.
A good guideline is to never use inheritance unless you actually have code that does different things to the field of a "Parent" class than to the same field in a "Child" class. If business code in your class doesn't specifically refer to a field, that field absolutely shouldn't cause inheritance.
But again, if you are in school, that may not match what they are trying to teach...
The "correct" answer for the purposes of an assignment is probably #3 :
Person
PersonId Name Address1 Address2 City Country
Student
PersonId StudentId GPA Year ..
Staff
PersonId StaffId Salary ..
Parent
PersonId ParentId ParentType EmergencyContactNumber ..
Where PersonId is always the primary key, and also a foreign key in the last three tables.
I like this approach because it makes it easy to represent the same person having more than one role. A teacher could very well also be a parent, for example.
I suggest five tables
Person
Student
Staff
Parent
Address
WHy - because people can have multiple addesses and people can also have multiple roles and the information you want for staff is different than the information you need to store for parent or student.
Further you may want to store name as last_name, Middle_name, first_name, Name_suffix (like jr.) instead of as just name. Belive me you willwant to be able to search on last_name! Name is not unique, so you will need to make sure you have a unique surrogate primary key.
Please read up about normalization before trying to design a database. Here is a source to start with:
http://www.deeptraining.com/litwin/dbdesign/FundamentalsOfRelationalDatabaseDesign.aspx
Super type Person should be created like this:
CREATE TABLE Person(PersonID int primary key, Name varchar ... etc ...)
All Sub types should be created like this:
CREATE TABLE IF NOT EXISTS Staffs(StaffId INT NOT NULL ,
PRIMARY KEY (StaffId) ,
CONSTRAINT FK_StaffId FOREIGN KEY (StaffId) REFERENCES Person(PersonId)
)
CREATE TABLE IF NOT EXISTS Students(StudentId INT NOT NULL ,
PRIMARY KEY (StudentId) ,
CONSTRAINT FK_StudentId FOREIGN KEY (StudentId) REFERENCES Person(PersonId)
)
CREATE TABLE IF NOT EXISTS Parents(PersonID INT NOT NULL ,
PRIMARY KEY (PersonID ) ,
CONSTRAINT FK_PersonID FOREIGN KEY (PersonID ) REFERENCES Person(PersonId)
)
Foreign key in subtypes staffs,students,parents adds two conditions:
Person row cannot be deleted unless corresponding subtype row will
not be deleted. For e.g. if there is one student entry in students
table referring to Person table, without deleting student entry
person entry cannot be deleted, which is very important. If Student
object is created then without deleting Student object we cannot
delete base Person object.
All base types have foreign key "not null" to make sure each base
type will have base type existing always. For e.g. If you create
Student object you must create Person object first.