What would you do to avoid conflicting data in this database schema? - database

I'm working on a multi-user internet database-driven website with SQL Server 2008 / LinqToSQL / custom-made repositories as the DAL. I have run across a normalization problem which can lead to an inconsistent database state if exploited correctly and I am wondering how to deal with the problem.
The problem: Several different companies have access to my website. They should be able to track their Projects and Clients at my website. Some (but not all) of the projects should be assignable to clients.
This results in the following database schema:
**Companies:**
ID
CompanyName
**Clients:**
ID
CompanyID (not nullable)
FirstName
LastName
**Projects:**
ID
CompanyID (not nullable)
ClientID (nullable)
ProjectName
This leads to the following relationships:
Companies-Clients (1:n)
Companies-Projects (1:n)
Clients-Projects(1:n)
Now, if a user is malicious, he might for example insert a Project with his own CompanyID, but with a ClientID belonging to another user, leaving the database in an inconsistent state.
The problem occurs in a similar fashion all over my database schema, so I'd like to solve this in a generic way if any possible. I had the following two ideas:
Check for database writes that might lead to inconsistencies in the DAL. This would be generic, but requires some additional database queries before an update and create queries are performed, so it will result in less performance.
Create an additional table for the clients-Projects relationship and make sure the relationships created this way are consistent. This also requires some additional select queries, but far less than in the first case. On the other hand it is not generic, so it is easier to miss something in the long run, especially when adding more tables / dependencies to the database.
What would you do? Is there any better solution I missed?
Edit: You might wonder why the Projects table has a CompanyID. This is because I want users to be able to add projects with and without clients. I need to keep track of which company (and therefore which website user) a clientless project belongs to, which is why a project needs a CompanyID.

I'd go with with the latter, having one or more tables that define the allowable relationships between entities.

Note, there's no circularity in the references you have, so the title is misleading.
What you have is the possibility of conflicting data, that's different.
Why do you have "CompanyID" in the project table? The ID of the company involved is implicitly given by the client you link to. You don't need it.
Remove that column and you've removed your problem.
Additionally, what is the purpose of the "name" column in the client table? Can you have a client with one name, differing from the name of the company?
Or is "client" the person at that company?
Edit: Ok with the clarification about projects without companies, I would separate out the references, but you're not going to get rid of the problem you're describing without constraints that prevent multiple references being made.
A simple constraint for your existing tables would be that not both the CompanyID and ClientID fields of the project row could be non-null at the same time.

If you want to use the table like this and avoid the all the new queries just put triggers on the table and when user tries to insert row with wrong data the trigger with stop him.
Best Regards,
Iordan

My first thought would be to create a special client record for each company with name "No client". Then eliminate the CompanyId from the Project table, and if a project has no client, use the "No client" record rather than a "normal" client record. If processing of such no-client's is special, add a flag to the no-client record to explicitly identify it. (I'd hate to rely on the name being "No Client" or something like that -- too fuzzy.)
Then there would be no way to store inconsistent data so the problem would go away.

In the end I implemented a completely generic solution which solves my problem without much runtime overhead and without requiring any changes to the database. I'll describe it here in case someone else has the same problem.
First off, the approach only works because the only table that other tables are referencing through multiple paths is the Companies table. Since this is the case in my database, I only have to check whether all n:1 referenced entities of each entity that is to be created / updated / deleted are referencing the same company (or no company at all).
I am enforcing this by deriving all of my Linq entities from one of the following types:
SingleReferenceEntityBase - The norm. Only checks (via reflection) if there really is only one reference (no matter if transitive or intransitive) to the Companies table. If this is the case, the references to the companies table cannot become inconsistent.
MultiReferenceEntityBase - For special cases such as the Projects table above. Asks all directly referenced entities what company ID they are referencing. Raises an exception if there is an inconsistency. This costs me a few select queries per CRUD operation, but since MultiReferenceEntities are much rarer than SingleReferenceEntities, this is negligible.
Both of these types implement a "CheckReferences" and I am calling it whenever the linq entity is written to the database by partially implementing the OnValidate(System.Data.Linq.ChangeAction action) method which is automatically generated for all Linq entities.

Related

Need Help Creating Primary Key/Foreign Key Relationships between Multiple Tables

Background:
(I'm using Microsoft SQL Server 2014)
My company receives data files (tblFile) that contain many accounts (tblAccount). For each data file, we may perform multiple "pricings" (tblPricing), and these "pricings" may contain all of the accounts in the file, or only a subset of them, but the "pricings" cannot contain any accounts not in the file from which the pricing is based. So, in summary:
We get a single file
This file can contain many accounts
We create many pricings from this single file
Each pricing can contain all or some of the accounts from the file it is linked to, but no accounts not in that file
Here is a (way) simplified database diagram as it exists today:
Problem:
What's working so far:
The 1:Many relationship between tblFile and tblPricing
The Many:Many relationship between tblFile and tblAccount (an account can exist in multiple files)
The Many:Many relationship between tblPricing and tblAccount (because many pricings can be performed, an account can exist in many pricings)
Our problem comes from trying to enforce integrity between the subset of accounts that a file has and the subset of accounts that a pricing has. With the above structure, the tblPricingAccounts can contain accounts not contained in the tblFileAccounts, violating our need for each pricing to contain only the accounts from the file of which it is based upon.
I've tried changing the foreign key relationships where I broke the link between tblPricingAccounts and tblAccount, removed 'acct_id' from tblPricingAccounts, and instead linked tblPricingAccounts to tblFileAccounts (yes, I know I need a primary key in tblFileAccounts and I had one). But, then I was able to insert whatever 'pricing_id' I wanted into tblPricingAccounts. Now I could link accounts to a pricing that had nothing to do with the file that originally contained those accounts.
Need
At the end of the day, I don't care what the structure or relationships of my database look like. I simply need the following criteria met, and I can't seem to wrap my mind around it:
A file contains many accounts.
A file contains many pricings.
A pricing contains many accounts, but those accounts MUST be contained in the file that the pricing is linked to.
Any help is appreciated, and I'm open to all suggestions that can be performed within SQL Server. Ultimately I'm building a web application around this database, and I'm using Entity Framework 6 to make life easier (mostly...). I could obviously enforce the above 3 needs through my code, but I really would like the database to be the last line of defense in enforcing this integrity.
This is a situation that foreign key constraints are not intended to handle. FK constraints test for existence of values between tables; they do not enforce particular cardinality requirements.
Simple cardinality is the "one to many", "many to many" relationships mentioned in the question. Your more complex need is still essentially about cardinality though: it's a requirement that certain subsets of rows relate to certain other subsets of rows in a particular way. "Windowed cardinality" if you will. (My own coinage as far as I know.)
As suggested in comments to the question, one way to enforce this wholly within the database is via triggers. A well crafted trigger in this case would probably test whether new rows to be inserted are valid, and erroring without insertion if not. For a bulk insert, you may wish to insert valid rows and throw the rest, or throw everything back if 1+ rows are invalid. You can also craft logic to handle updates or deletions that could break your integrity requirements.
Be aware that triggers will negatively affect performance, especially if the table is being changed frequently.
Other approaches are to handle this in application logic, as suggested, and/or allow data into the tables regardless, but validate existing data periodically. For example, a nightly process could identify data failing this requirement and pass to a human to correct.
It sounds like tblFileAccounts might be superfluous. Try removing it altogether and inferring which accounts exist in which files through the relationships captured in tblPricingAccounts and tblPricing.
If this meets your need, and there are no attributes (columns) which rightfully belong to the tblFileAccounts object (table), then I think your problem is solved.

SQLite: Individual tables per user or one table for them all?

I've already designed a website which uses an SQLite database. Instead of using one large table, I've designed it so that when a user signs up, a individual table is created for them. Each user will possibly use several hundreds of records. I done this because I thought it would be easier to structure and access.
I found on other questions on this site that one table is better than using many tables for each user.
Would it be worth redesigning my site so that instead of having many tables, there would be one large table? The current method of mine seems to work well though it is still in development so I'm not sure how well it would stack up in a real environment.
The question is: Would changing the code so that there is one large database instead of many individual ones be worth it in terms of performance, efficiency, organisation and space?
SQLite: Creating a user's table.
CREATE TABLE " + name + " (id INTEGER PRIMARY KEY, subject TEXT, topic TEXT, questionNumber INTEGER, question TEXT, answer TEXT, color TEXT)
SQLite: Adding an account to the accounts table.
"INSERT INTO accounts (name, email, password, activated) VALUES (?,?,?,?)", (name, email, password, activated,)
Please note that I'm using python with Flask if it makes any difference.
EDIT
I am also aware that there are questions like this already, however none state whether the advantages or disadvantages will be worth it.
In an object oriented language, would you make a class for every user? Or would you have an instance of a class for each user?
Having one table per user is a really bad design.
You can't search messages based on any field that isn't the username. With your current solution, how would you find all messages for a certain questionNumber?
You can't join with the messages tables. You have to make two queries, one to find the table name and one to actually query the table, which requires two round-trips to the database server.
Each user now has their own table schema. On an upgrade, you have to apply your schema migration to every messages table, and God help you if some of the tables are inconsistent with the rest.
It's effectively impossible to have foreign keys pointing to your messages table. You can't specify the table that the foreign key column points to, because it won't be the same.
You can have name conflicts with your current setup. What if someone registers with the username accounts? Admittedly, this is easy to fix by adding a user_ prefix, but still something to keep in mind.
SQL injection vulnerabilities. What if I register a user named lol; DROP TABLE accounts; --? Query parameters, the primary way of preventing such attacks, don't work on table names.
I could go on.
Please merge all of the tables, and read up on database normalization.

Advanced user info in database

I'm creating an Account table in my project's database. Each account has A LOT of properties:
login
email
password
birthday
country
avatarUrl
city
etc.
Most of them are nullable. My question is, how should I design this in database?
Should it be one table with all those properties? Or maybe should I create two tables, like AccountSet, and AccountInfoSet, where I would store all those 'advanced' user's settings? And last, but not least: if this should be two tables, what kind of relation should be between those tables?
If this is a relational database, then I definitely would not store those properties as fields in the Account table. Some reasons why:
Once your application goes to production (or maybe it's already there), the schema maintenance will become a nightmare. You will absolutely add more properties and having to constantly touch that table in production will be painful.
You will most likely end up with orphaned fields. I've seen this many times where you'll introduce a property and then stop using it, but it's baked into your schema and you might be too scared to remove it.
Ideally you want to avoid having such sparse data in a table (lots of fields with lots of nulls).
My suggestion would be to do what you're already thinking about and that's to introduce a property table for Accounts. You called it AccountInfoSet.
The table should look like this:
AccountId int,
Property nvarchar(50),
Value nvarchar(50)
(Of course you'll set the data types and sizes as you see fit.)
Then you'll join to the AccountInfoSet table and maybe pivot on the "advanced" properties - turn the rows into columns with a query.
In .NET you can also write a stored procedure that returns two queries with one call and look at the tables in the DataSet object.
Or you could just make two separate calls. One for Account and one for the properties.
Lots of ways to get the information out, but make sure you don't just add fields to Account if you're using a relational database.

Database Is-a relationship

My problem relates to DB schema developing and is as follows.
I am developing a purchasing module, in which I want to use for purchasing items and SERVICES.
Following is my EER diagram, (note that service has very few specialized attributes – max 2)
My problem is to keep products and services in two tables or just in one table?
One table option –
Reduces complexity as I will only need to specify item id which refers to item table which will have an “item_type” field to identify whether it’s a product or a service
Two table option –
Will have to refer separate product or service in everywhere I want to refer to them and will have to keep “item_type” field in every table which refers to either product or service?
Currently planning to use option 1, but want to know expert opinion on this matter. Highly appreciate your time and advice. Thanks.
I'd certainly go to the "two tables" option. You see, you have to distinguish Products and Services, so you may either use switch(item_type) { ... } in your program or entirely distinct code paths for Product and for Service. And if a need for updating the DB schema arises, switch is harder to maintain.
The second reason is NULLs. I'd advise avoid them as much as you can — they create more problems than they solve. With two tables you can declare all fields non-NULL and forget about NULL-processing. With one table option, you have to manually write code to ensure that if item_type=product, then Product-specific fields are not NULL, and Service-specific ones are, and that if item_type=service, then Service-specific fields are not NULL, and Product-specific ones are. That's not quite pleasant work, and the DBMS can't do it for you (there is no NOT NULL IF another_field = value column constraint in SQL or anything like this).
Go with two tables. It's easier to support. I once saw a DB where everything, every single piece of data went in just two tables — there were pages and pages of code to make sure that necessary fields are not NULL.
If I were to implement I would have gone for the Two table option, It's kinda like the first rule of normalization of the schema. To remove multi-valued attributes. Using item_type is not recommended. Once you create separate tables you dont need to use the item_type you can just use the foreign key relationship.
Consider reading this article :
http://en.wikipedia.org/wiki/Database_normalization
It should help.

Database design - do I need one of two database fields for this?

I am putting together a schema for a database. The goal of the database is to track applications in our department. I have a repeated problem that I am trying to solve.
For example, I have an "Applications" table. I want to keep track if any application uses a database or a bug tracking system so right now I have fields in the Applications table called
Table: Applications
UsesDatabase (bit)
Database_ID (int)
UsesBugTracking (bit)
BugTracking_ID (int)
Table: Databases:
id
name
Table: BugTracking:
id
name
Should I consolidate the "uses" column with the respective ID columns so there is only one bug tracking column and only one database column in the applications table?
Any best practice here for database design?
NOTE: I would like to run reports like "Percent of Application that use bug tracking" (although I guess either approach could generate this data.)
You could remove the "uses" fields and make the id columns nullable, and let a null value mean that it doesn't use the feature. This is a common way of representing a missing value.
Edit:
To answer your note, you can easily get that statistics like this:
select
count(*) as TotalApplications,
count(Database_ID) as UsesDatabase,
count(BugTracking_ID) as UsesBugTracking
from
Applications
Why not get rid of the two Use fields and simply let a NULL value in the _ID fields indicate that the record does not use that application (bug tracking or database)
Either solution works. However, if you think you may want to occasionally just get a list of applications which do / do not have databases / bugtracking consider that having the flag fields reduces the query by one (or two) joins.
Having the bit fields is slightly denormalized, as you have to keep two fields in sync to keep one piece of data updated, but I tend to prefer them for cases like this for the reason I gave in the prior paragraph.
Another option would be to have the field nullable, and put null in it for those entries which do not have DBs / etc, but then you run into problems with foreign key constraints.
I don't think there is any one supreme right way, just consider the tradeoffs and go with what makes sense for your application.
I would use 3 tables for the objects: Application, Database, and BugTracking. Then I would use 2 join tables to do 1-to-many joins: ApplicationDatabases, and ApplicationBugTracking.
The 2 join tables would have both an application_id and the id of the other table. If an application used a single database, it would have a single ApplicationDatabases record joining them together. Using this setup, an application could have 0 database (no records for this app in the ApplicationDatabases table), or many databases (multiple records for this app in the ApplicationDatabases table).
"Should i consolidate the "uses" column"
If I look at your problem statement, then there either is no "uses" column at all, or there are two. In either case, it is wrong of you to speak of "THE" uses column.
May I politely suggest that you learn to be PRECISE when asking questions ?
Yes using null in the foreign key fields should be fine - it seems superfluous to have the bit fields.
Another way of doing it (though it might be considered evil by database people ^^) is to default them to 0 and add in an ID 0 data row in both bugtrack and database tables with a name of "None"... when you do the reports, you'll have to do some more work unless you present the "None" values as they are as well with a neat percentage...
To answer the edited question-
Yes, the fields should be combined, with NULL meaning that the application doesn't have a database (or bug tracker).

Resources