Need Help Creating Primary Key/Foreign Key Relationships between Multiple Tables - sql-server

Background:
(I'm using Microsoft SQL Server 2014)
My company receives data files (tblFile) that contain many accounts (tblAccount). For each data file, we may perform multiple "pricings" (tblPricing), and these "pricings" may contain all of the accounts in the file, or only a subset of them, but the "pricings" cannot contain any accounts not in the file from which the pricing is based. So, in summary:
We get a single file
This file can contain many accounts
We create many pricings from this single file
Each pricing can contain all or some of the accounts from the file it is linked to, but no accounts not in that file
Here is a (way) simplified database diagram as it exists today:
Problem:
What's working so far:
The 1:Many relationship between tblFile and tblPricing
The Many:Many relationship between tblFile and tblAccount (an account can exist in multiple files)
The Many:Many relationship between tblPricing and tblAccount (because many pricings can be performed, an account can exist in many pricings)
Our problem comes from trying to enforce integrity between the subset of accounts that a file has and the subset of accounts that a pricing has. With the above structure, the tblPricingAccounts can contain accounts not contained in the tblFileAccounts, violating our need for each pricing to contain only the accounts from the file of which it is based upon.
I've tried changing the foreign key relationships where I broke the link between tblPricingAccounts and tblAccount, removed 'acct_id' from tblPricingAccounts, and instead linked tblPricingAccounts to tblFileAccounts (yes, I know I need a primary key in tblFileAccounts and I had one). But, then I was able to insert whatever 'pricing_id' I wanted into tblPricingAccounts. Now I could link accounts to a pricing that had nothing to do with the file that originally contained those accounts.
Need
At the end of the day, I don't care what the structure or relationships of my database look like. I simply need the following criteria met, and I can't seem to wrap my mind around it:
A file contains many accounts.
A file contains many pricings.
A pricing contains many accounts, but those accounts MUST be contained in the file that the pricing is linked to.
Any help is appreciated, and I'm open to all suggestions that can be performed within SQL Server. Ultimately I'm building a web application around this database, and I'm using Entity Framework 6 to make life easier (mostly...). I could obviously enforce the above 3 needs through my code, but I really would like the database to be the last line of defense in enforcing this integrity.

This is a situation that foreign key constraints are not intended to handle. FK constraints test for existence of values between tables; they do not enforce particular cardinality requirements.
Simple cardinality is the "one to many", "many to many" relationships mentioned in the question. Your more complex need is still essentially about cardinality though: it's a requirement that certain subsets of rows relate to certain other subsets of rows in a particular way. "Windowed cardinality" if you will. (My own coinage as far as I know.)
As suggested in comments to the question, one way to enforce this wholly within the database is via triggers. A well crafted trigger in this case would probably test whether new rows to be inserted are valid, and erroring without insertion if not. For a bulk insert, you may wish to insert valid rows and throw the rest, or throw everything back if 1+ rows are invalid. You can also craft logic to handle updates or deletions that could break your integrity requirements.
Be aware that triggers will negatively affect performance, especially if the table is being changed frequently.
Other approaches are to handle this in application logic, as suggested, and/or allow data into the tables regardless, but validate existing data periodically. For example, a nightly process could identify data failing this requirement and pass to a human to correct.

It sounds like tblFileAccounts might be superfluous. Try removing it altogether and inferring which accounts exist in which files through the relationships captured in tblPricingAccounts and tblPricing.
If this meets your need, and there are no attributes (columns) which rightfully belong to the tblFileAccounts object (table), then I think your problem is solved.

Related

Auto-Complete/Primary Key as String - PostgreSQL

I setup a database that is not too complex but still nonetheless has multiple many-to-many relationships. Let me explain the database first briefly using three tables(there are many more, but just to keep things simple):
Database is storing information about projects completed. One attribute is software used. So I have three tables(with respective columns/keys):
tblProjects(ProjectID[PK], ProjectTitle, etc...)
tblProjectsSoftware(SoftwareID[FK], ProjectID[FK], UniqueID[PK])
tblSoftwareUsed(SoftwareID[PK], SoftwareName)
In order to make data entry easier in phppgadmin, I was considering just making 'SoftwareName' the primary key in tblSoftwareUsed. This is because when I go to enter the software associated with certain projects into tblProjectsSoftware, I can only use the auto-complete feature on the SoftwareID column which is just more or less a meaningless number.
As you can see, when entering data into the SoftwareID column of tblSoftwareUsed, I would only be able to 'filter' results by the ID and not the name. When this database gets large, it may not be an issue for software, but there are some other attributes that will have tons of records. To explain that further, I would start my data entry by creating a record for the project in tblProjects. Then I would create new records (if necessary) for software used. Then, when entering data into tblProjectsSoftware, I would either have to know the ID of the software or click through a few pages to find it.
So, my question is, would I have any issues by making the name of the software my Primary Key, or would it be better to just leave it as is with the ID as the PK? Furthermore, maybe I am missing an option to make 'SoftwareName' searchable as in addition to the ID.
There are advantages and disadvantages to using surrogate keys, which are discussed at length in this wikipedia article:
http://en.wikipedia.org/wiki/Surrogate_key
Borrowing their headers...
Advantages:
Immutability
Requirement changes
Performance
Compatibility
Uniformity
Validation
Disadvantages:
Disassociation
Query optimization
Normalization
Business process modeling
Inadvertent disclosure
Inadvertent assumptions
More often than not, you'll want to use a surrogate key for practical reasons -- such as avoiding headaches when you need to update a software name.

Database Is-a relationship

My problem relates to DB schema developing and is as follows.
I am developing a purchasing module, in which I want to use for purchasing items and SERVICES.
Following is my EER diagram, (note that service has very few specialized attributes – max 2)
My problem is to keep products and services in two tables or just in one table?
One table option –
Reduces complexity as I will only need to specify item id which refers to item table which will have an “item_type” field to identify whether it’s a product or a service
Two table option –
Will have to refer separate product or service in everywhere I want to refer to them and will have to keep “item_type” field in every table which refers to either product or service?
Currently planning to use option 1, but want to know expert opinion on this matter. Highly appreciate your time and advice. Thanks.
I'd certainly go to the "two tables" option. You see, you have to distinguish Products and Services, so you may either use switch(item_type) { ... } in your program or entirely distinct code paths for Product and for Service. And if a need for updating the DB schema arises, switch is harder to maintain.
The second reason is NULLs. I'd advise avoid them as much as you can — they create more problems than they solve. With two tables you can declare all fields non-NULL and forget about NULL-processing. With one table option, you have to manually write code to ensure that if item_type=product, then Product-specific fields are not NULL, and Service-specific ones are, and that if item_type=service, then Service-specific fields are not NULL, and Product-specific ones are. That's not quite pleasant work, and the DBMS can't do it for you (there is no NOT NULL IF another_field = value column constraint in SQL or anything like this).
Go with two tables. It's easier to support. I once saw a DB where everything, every single piece of data went in just two tables — there were pages and pages of code to make sure that necessary fields are not NULL.
If I were to implement I would have gone for the Two table option, It's kinda like the first rule of normalization of the schema. To remove multi-valued attributes. Using item_type is not recommended. Once you create separate tables you dont need to use the item_type you can just use the foreign key relationship.
Consider reading this article :
http://en.wikipedia.org/wiki/Database_normalization
It should help.

Database Child Table with Two Possible Parents

First off, I'm not sure how exactly to search for this, so if it's a duplicate please excuse me. And I'm not even sure if it'd be better suited to one of the other StackExchange sites; if so, please let me know and I'll ask over there instead. Anyways...
Quick Overview of the Project
I'm working on a hobby project -- a writer's notebook of sorts -- to practice programming and database design. The basic structure is fairly simple: the user can create notebooks, and under each notebook they can create projects associated with that notebook. Maybe the notebook is for a series of short stories, and each project is for an individual story.
They can then add items (scenes, characters, etc.) to either a specific project within the notebook, or to the notebook itself so that it's not associated with a particular project. This way, they can have scenes or locations that span multiple projects, as well as having some that are specific to a particular project.
The Problem
I'm trying to keep a good amount of the logic within the database -- especially within the table structure and constraints if at all possible. The basic structure I have for a lot of the items is basically like this (I'm using MySql, but this is a pretty generic problem -- just mentioning it for the syntax):
CREATE TABLE SCENES(
ID BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY NOT NULL,
NOTEBOOK BIGINT UNSIGNED NULL,
PROJECT BIGINT UNSIGNED NULL,
....
);
The problem is that I need to ensure that at least one of the two references, NOTEBOOK and/or PROJECT, are set. They don't have to both be set -- PROJECT has a reference to the NOTEBOOK it's in. I know I could just have a generic "Parent Id" field, but I don't believe it'd be possible to have a foreign key to two tables, right? There's also the possibility of adding additional cross-reference tables -- i.e. SCENES_X_NOTEBOOKS and SCENES_X_PROJECTS -- but that'd get out of hand pretty quickly, since I'd have to add similar tables for each of the different item types I'm working with. That would also introduce the problem of ensuring each item has an entry in the cross reference tables.
It'd be easy to put this kind of logic in a stored procedure or the application logic, but I'd really like to keep it in a constraint of some kind if at all possible, just to eliminate any possibility that the logic got bypassed some how.
Any thoughts? I'm interested in pretty much anything -- even if it involves a redesign of the tables or something.
The thing about scenes and characters is that a writer might drop them from their current project. When that happens, you don't want to lose the scenes and characters, because the writer might decide to use them years later.
I think the simplest approach is to redefine this:
They can then add items (scenes, characters, etc.) to either a
specific project within the notebook, or to the notebook itself so
that it's not associated with a particular project.
Instead of that, think about saying this.
They can then add items (scenes, characters, etc.) to either a
user-defined project within the notebook, or to the system-defined
project named "Unassigned". The project "Unassigned" is for things not
currently assigned to a user-defined project.
If you do that, then scenes and characters will always be assigned to a project--either to a user-defined project, or to the system-defined project named "Unassigned".
I'm unclear as to what exactly are you requirements, but let me at least try to answer some of your individual questions...
The problem is that I need to ensure that at least one of the two references, NOTEBOOK and/or PROJECT, are set.
CHECK (NOTEBOOK IS NOT NULL OR PROJECT IS NOT NULL)
I don't believe it'd be possible to have a foreign key to two tables, right?
Theoretically, you can reference two tables from the same field, but this would mean both of these tables would have to contain the matching row. This is probably not what you want.
You are on the right track here - let the NOTEBOOK be the child endpoint of a FK towards one table and the PROJECT towards the other. A NULL foreign key will not be enforced, so you don't have to set both of them.
There's also the possibility of adding additional cross-reference tables -- i.e. SCENES_X_NOTEBOOKS and SCENES_X_PROJECTS -- but that'd get out of hand pretty quickly, since I'd have to add similar tables for each of the different item types I'm working with.
If you are talking about junction (aka. link) tables that model many-to-many relationships, then yes - you'd have to add them for each pair of tables engaged in such a relationship.
You could, however, minimize the number of such table pairs by using inheritance (aka. category, subclassing, subtype, generalization hierarchy...). Imagine you have a set of M tables that have to be connected to a second set of N tables. Normally, you'd have create M*N junction tables. But if you inherit all tables in the first set from a common parent table, and do the same for the second set, you can now connect them all through just one junction table between these two parent tables.
The full discussion on inheritance is beyond the scope here, but you might want to look at "ERwin
Methods Guide", chapter "Subtype Relationships", as well as this post.
It'd be easy to put this kind of logic in a stored procedure or the application logic, but I'd really like to keep it in a constraint of some kind if at all possible, just to eliminate any possibility that the logic got bypassed some how.
Your instincts are correct - make database "defend" itself from the bad data as much as possible. Here is the order of preference for ensuring the correctness of your data:
The structure of tables themselves.
The declarative database constraints (integrity of domain, integrity of key and referential integrity).
Triggers and stored procedures.
Middle tier.
Client.
For example, if you can ensure a certain logic must be followed just by using declarative database constraints, don't put it in triggers.

What would you do to avoid conflicting data in this database schema?

I'm working on a multi-user internet database-driven website with SQL Server 2008 / LinqToSQL / custom-made repositories as the DAL. I have run across a normalization problem which can lead to an inconsistent database state if exploited correctly and I am wondering how to deal with the problem.
The problem: Several different companies have access to my website. They should be able to track their Projects and Clients at my website. Some (but not all) of the projects should be assignable to clients.
This results in the following database schema:
**Companies:**
ID
CompanyName
**Clients:**
ID
CompanyID (not nullable)
FirstName
LastName
**Projects:**
ID
CompanyID (not nullable)
ClientID (nullable)
ProjectName
This leads to the following relationships:
Companies-Clients (1:n)
Companies-Projects (1:n)
Clients-Projects(1:n)
Now, if a user is malicious, he might for example insert a Project with his own CompanyID, but with a ClientID belonging to another user, leaving the database in an inconsistent state.
The problem occurs in a similar fashion all over my database schema, so I'd like to solve this in a generic way if any possible. I had the following two ideas:
Check for database writes that might lead to inconsistencies in the DAL. This would be generic, but requires some additional database queries before an update and create queries are performed, so it will result in less performance.
Create an additional table for the clients-Projects relationship and make sure the relationships created this way are consistent. This also requires some additional select queries, but far less than in the first case. On the other hand it is not generic, so it is easier to miss something in the long run, especially when adding more tables / dependencies to the database.
What would you do? Is there any better solution I missed?
Edit: You might wonder why the Projects table has a CompanyID. This is because I want users to be able to add projects with and without clients. I need to keep track of which company (and therefore which website user) a clientless project belongs to, which is why a project needs a CompanyID.
I'd go with with the latter, having one or more tables that define the allowable relationships between entities.
Note, there's no circularity in the references you have, so the title is misleading.
What you have is the possibility of conflicting data, that's different.
Why do you have "CompanyID" in the project table? The ID of the company involved is implicitly given by the client you link to. You don't need it.
Remove that column and you've removed your problem.
Additionally, what is the purpose of the "name" column in the client table? Can you have a client with one name, differing from the name of the company?
Or is "client" the person at that company?
Edit: Ok with the clarification about projects without companies, I would separate out the references, but you're not going to get rid of the problem you're describing without constraints that prevent multiple references being made.
A simple constraint for your existing tables would be that not both the CompanyID and ClientID fields of the project row could be non-null at the same time.
If you want to use the table like this and avoid the all the new queries just put triggers on the table and when user tries to insert row with wrong data the trigger with stop him.
Best Regards,
Iordan
My first thought would be to create a special client record for each company with name "No client". Then eliminate the CompanyId from the Project table, and if a project has no client, use the "No client" record rather than a "normal" client record. If processing of such no-client's is special, add a flag to the no-client record to explicitly identify it. (I'd hate to rely on the name being "No Client" or something like that -- too fuzzy.)
Then there would be no way to store inconsistent data so the problem would go away.
In the end I implemented a completely generic solution which solves my problem without much runtime overhead and without requiring any changes to the database. I'll describe it here in case someone else has the same problem.
First off, the approach only works because the only table that other tables are referencing through multiple paths is the Companies table. Since this is the case in my database, I only have to check whether all n:1 referenced entities of each entity that is to be created / updated / deleted are referencing the same company (or no company at all).
I am enforcing this by deriving all of my Linq entities from one of the following types:
SingleReferenceEntityBase - The norm. Only checks (via reflection) if there really is only one reference (no matter if transitive or intransitive) to the Companies table. If this is the case, the references to the companies table cannot become inconsistent.
MultiReferenceEntityBase - For special cases such as the Projects table above. Asks all directly referenced entities what company ID they are referencing. Raises an exception if there is an inconsistency. This costs me a few select queries per CRUD operation, but since MultiReferenceEntities are much rarer than SingleReferenceEntities, this is negligible.
Both of these types implement a "CheckReferences" and I am calling it whenever the linq entity is written to the database by partially implementing the OnValidate(System.Data.Linq.ChangeAction action) method which is automatically generated for all Linq entities.

Do 1 to 1 relations on db tables smell?

I have a table that has a bunch of fields. The fields can be broken into logical groups - like a job's project manager info. The groupings themselves aren't really entity candidates as they don't and shouldn't have their own PKs.
For now, to group them, the fields have prefixes (PmFirstName for example) but I'm considering breaking them out into multiple tables with 1:1 relations on the main table.
Is there anything I should watch out for when I do this? Is this just a poor choice?
I can see that maybe my queries will get more complicated with all the extra joins but that can be mitigated with views right? If we're talking about a table with less than 100k records is this going to have a noticeable effect on performance?
Edit: I'll justify the non-entity candidate thoughts a little further. This information is entered by our user base. They don't know/care about each other. So its possible that the same user will submit the same "projectManager name" or whatever which, at this point, wouldn't be violating any constraint. Its for us to determine later on down the pipeline if we wanna correlate entries from separate users. If I were to give these things their own key they would grow at the same rate the main table grows - since they are essentially part of the same entity. At no pt is a user picking from a list of available "project managers".
So, given the above, I don't think they are entities. But maybe not - if you have further thoughts please post.
I don't usually use 1 to 1 relations unless there is a specific performance reason for it. For example storing an infrequently used large text or BLOB type field in a separate table.
I would suspect that there is something else going on here though. In the example you give - PmFirstName - it seems like maybe there should be a single pm_id relating to a "ProjectManagers" or "Employees" table. Are you sure none of those groupings are really entity candidates?
To me, they smell unless for some rows or queries you won't be interested in the extra columns. e.g. if for a large portion of your queries you are not selecting the PmFirstName columns, or if for a large subset of rows those columns are NULL.
I like the smells tag.
I use 1 to 1 relationships for inheritance-like constructs.
For example, all bonds have some basic information like CUSIP, Coupon, DatedDate, and MaturityDate. This all goes in the main table.
Now each type of bond (Treasury, Corporate, Muni, Agency, etc.) also has its own set of columns unique to it.
In the past we would just have one incredibly wide table with all that information. Now we break out the type-specific info into separate tables, which gives us much better performance.
For now, to group them, the fields have prefixes (PmFirstName for example) but I'm considering breaking them out into multiple tables with 1:1 relations on the main table.
Create a person table, every database needs this. Then in your project table have a column called PMKey which points to the person table.
Why do you feel that the group of fields are not an entity candidates? If they are not then why try to identify them with a prefix?
Either drop the prefixes or extract them into their own table.
It is valuable splitting them up into separate tables if they are separate logical entities that could be used elsewhere.
So a "Project Manager" could be 1:1 with all the projects currently, but it makes sense that later you might want to be able to have a Project Manager have more than one project.
So having the extra table is good.
If you have a PrimaryFirstName,PrimaryLastName,PrimaryPhone, SecondaryFirstName,SecondaryLastName,SEcondaryPhone
You could just have a "Person" table with FirstName, LastName, Phone
Then your original Table only needs "PrimaryId" and "SecondaryId" columns to replace the 6 columns you previously had.
Also, using SQL you can split up filegroups and tables across physical locations.
So you could have a POST table, and a COMMENT Table, that have a 1:1 relationship, but the COMMENT table is located on a different filegroup, and on a different physical drive with more memory.
1:1 does not always smell. Unless it has no purpose.

Resources