Data Vault Modelling Foreign Keys - data-modeling

I have a question regarding specific data vault modelling.
I have a source table which captures call center CALL informations like this:
CallId (business key)
Date
Call_alert
Call_acw
etc
The same source table also has a bunch of foreign keys in it, like this:
RouteID (on which line the call eventually ends)
ConnectionType (phone, email etc)
Via each foreign key it is possible to retrieve extra-information about the key (which is not linked to the CALL).
My question is how to model these foreign keys in my model? Do i keep them as attributes in my satellite or do i model them as links? Or any other option i haven't thought about?
Thanks!!

I'll focus on one example you give (RouteID) but the discussion is probably the same for each.
The first thing to remember is that the aim in Data Vault is to model the business and business processes not the system the data is stored in. Foreign keys may be an indication of something meaningful (a linking between two hubs) or they may not (the product of normalisation within the database that you may not need to replicate).
The first step in your case is to think about what RouteID and the data it links through to means to the business. If route (or the line it represents) is a meaningful concept to the business in its own right then it probably requires it's own hub, satellites for data relating to it and then link tables to join it up to your call data.
On the other hand the data might only have meaning as categorisation of another hub (call in your case) in which case think about de-normalising it in to a satellite that connects to the call hub. Remember that you can have multiple satellites connected to one hub, there's nothing preventing you having a call route satellite, a connection type satellite and so on.
You'll need to make this decision for each foreign key and probably end up with different choices for each. Staff member receiving the call for example would almost certainly be a link to another hub as you almost certainly have other data you want to link staff to. Something like the connection type you mention is unlikely to be meaningful in its own right so is more likely to form part of a satellite.

Related

Best way to model resource ownership

Lets say I have a number of (less than fifty) entities in my datamodel and I need to store who owns this entity for security reasons. I need to be able to decide on each request if the user doing a specific action on a resource is allowed to do that (who is doing what on which resource). And for this the resource ownership is needed.
I can think of several different ways to do this. One is that in each table I can have a foreing key pointing to the owner. One downside with this solution is that in code I need to look at each individual table to find out the ownership. Each time there is a new table added I would need to update the code to look in that new table.
Another solution could be to treat every specific entity as a generic resource, a resource that has an ownership. And store that ownership in one single table. I could even do that without any foreign key relationship and deal with it in code to keep the resource table in sync and e.g. make sure that each new entry in any table has a corresponding record in the "resource" table. One obvious downside would be that there will be a lot of records in this table. The benefit would be that there is one single place to go to to find the ownership.
So what would be the preferred way? Would there be a performance problem with storing the ownership in one table, given there might be in the range of hundreds of thousands of records (possibly even millions) in it eventually? What about the cost of preserving lots of foreign key constraints? Is there a better way of solving this?
Thanks
You're working in an object oriented language. Inheritance is perfect to solve this problem.
Depending on if you use Code First, or DB First, your approach will be slightly different, but boils down to this:
Make and abstract class, you can call it something like 'OwnableEntity'. In essence, you put your foreign key and navigation property in there.
Inherit all your entities from this 'OwnableEntity'
Make sure that your inheritance mapping in EF is correct (in this case, TPC inheritance mapping is what you will probably want to use)
From now on, you can write your 'check ownership' logic against 'OwnableEntity', and it will be OK for every entity you implement later on.

How can I get rid of an "indirect" foreign key? Should I?

Here's an example schema to illustrate what I'm talking about:
Let's say I'm storing information about some activities (seminars, trainings, whatever) that are being hosted in a certain set of locations, identified by type (hackerspace, swimming pool, etc) and city. Each activity happens at all of the locations of a suitable type at once (e.g. any programming seminar happens at all of the hackerspaces at once), so any person may choose to attend an activity in any of the suitable locations. Therefore, any activity is associated only with some location type, while an attendance record is associated with some activity (and therefore implicitly with some location type) and the city where this particular user attended the activity.
The most common query in the system by far is generating a report of all activities attended by a given person.
Am I right in feeling that this is ugly? Should I try to redesign this, and if so, how?
P.S. I'd rather not reveal the actual data I'm storing in a database where I had to employ a similar design, so I hope that this analogy makes some sense.
It sounds like you need a LocationTypes table with a list of location types. Then, Location can have a foreign key relationship to LocationTypes.
But, I don't like assuming that the set of locations doesn't change over time. So that is overly simplistic. So, I would have another entity of something like LocationSets, which would list the locations for a given activity over time. The LocationSets would contain the "type" which can be used. The locations associated with a location set would be in another table, a junction table connecting the location sets and the locations.
Then Activities would have a LocationSetId. And Attendance would have a LocationId. You might want to enforce that at any given time, the Attendance location is consistent with the locations in the Activity's LocationSets. This could be done at the application layer, through a trigger in the database, or through mechanism such as a function-based constraint (if your database supports those).

Is it insecure to reveal a row's primary key to the user?

Why do many applications replace the primary key of a database with a seemingly random alternative id when revealing the record to the user?
My guess is that it prevents users from guessing other rows in the table. If so, isn't that just false sense of security?
I guess you are talking about surrogate keys here. One of the desired or supposed advantages of surrogate keys is that they aren't burdened by any external meaning or dependency on anything outside the database. So for example the surrogate key values can safely be reassigned or the key can be refactored or discarded without any consequences for users of the system.
Generally surrogate keys are kept hidden from users so that they don't acquire any such external dependencies. Being hidden from users was in fact part of the original definition of a surrogate key as proposed by E.F.Codd. If key values reside in the user's browser cache or favourites list then they aren't much use as "surrogates" any more. So that's one common reason why you will see one key used only inside the database and a different key for the same table made visible in the application.
I think it may depend on the type of application you are working with. I work with Enterprise software that is only used by the company I work for and is not generally available to the outside world. In this case, it is often critical to let the user see the surrogate key for people-related records because the information in the person table has no uniqueness. There can be two John Smiths (we actually have over 1000 of them) who are genuinely different people. They may even have the same business address and be different people (Sons are often named for fathers and work in the same medical practice for instance). So they need to refer to the surrogate key on forms and in reporting to ensure they are using the record they thought they wanted. OItherwise if they wanted to research further details about the John Smith that they saw in a report, how would they look it up in the aaplication without having to go through all 1000 to find the right one? Creating a fake id as well as the real one would be time consuming (we import millions of records at a time) and for no real gain since the data would not be visible outside our comapny application.
For a web app that is open to the general public, I can see where you might not want to show this information.

Should I expose a user ID to public?

I have a form that reveals user IDs to public. I was wondering that is this dangerous. Personally I do not see anything bad about it. The ID is just used to reference a single database record.
If it were dangerous, Stack Overflow wouldn't be displaying user IDs in their URLs in order to make user profile lookups work: https://stackoverflow.com/users/104826/rfactor
Edit of seriousness of immense levels: if user IDs are themselves sensitive data; for example your primary keys for some reason happen to be social security numbers, that'll definitely be a security and privacy liability. If your user IDs are just auto-increment numbers though, you're clear.
Generally it's not a problem but it can give away hints on how active your site is, like how many users you have etc. If you consider this sensitive information or maybe even good marketing is completely up to you.
There's a story that this was one of the reasons the germans lost the WW2. They had sequential serial numbers from production written on each tank. By collecting id numbers from tanks taken out the british could estimate how many tanks the whole german army had and make new strategies from that.
I have found that exposing primary keys that identify physical entities can create headaches.
Imagine if two blood samples come into a laboratory and test results are generated for each sample. Many different kinds of test might be done and each record representing a test result will have the sample_id as a foreign key.
If you share the database ID with the customer and you discover that two samples were accidentally switched, you will have to update the foreign keys in all the detail records representing the tests. If you instead exposed some other unique name outside your system, you will just need to switch the two unique names on the sample records in the master table.
There are other advantages related to data migration and there are advantages when entities are represented in more than one database in which it is difficult to create records with identical database ID's.
In my experience it is always best to expose a unique identifier other than the primary key outside your system. It gives you more flexibility in resolving data mix-ups, dealing with data migration issues, and in otherwise future-proofing your system.
as For me ID is as dangerous as showing user name.
Exposing an user ID is not, in and of itself, bad. It depends on the level of privacy and security needed. If the user ID does not expose and cannot be tied to any other personal data that should otherwise be private, it may not be a problem.
But don't think that public user IDs can never be a problem.
Make sure you don't allow anyone to break in to any private data just by knowing user IDs. Facebook has had problems like that. Here's just one example. While revealing user IDs wasn't the whole story, it was part of the equation.
Will it hurt anything? Only you can decide that, and you should think that through.
But in general, it is poor form to display the User ID without having a business reason to do so. (Saves you work is probably not a good business reason.)
If it is a generated database id with no other meaning, it's not dangerous. Though I don't think revealing an id is elegant either. It's a technical detail and I can't understand why you would like to show it to users.

What would you do to avoid conflicting data in this database schema?

I'm working on a multi-user internet database-driven website with SQL Server 2008 / LinqToSQL / custom-made repositories as the DAL. I have run across a normalization problem which can lead to an inconsistent database state if exploited correctly and I am wondering how to deal with the problem.
The problem: Several different companies have access to my website. They should be able to track their Projects and Clients at my website. Some (but not all) of the projects should be assignable to clients.
This results in the following database schema:
**Companies:**
ID
CompanyName
**Clients:**
ID
CompanyID (not nullable)
FirstName
LastName
**Projects:**
ID
CompanyID (not nullable)
ClientID (nullable)
ProjectName
This leads to the following relationships:
Companies-Clients (1:n)
Companies-Projects (1:n)
Clients-Projects(1:n)
Now, if a user is malicious, he might for example insert a Project with his own CompanyID, but with a ClientID belonging to another user, leaving the database in an inconsistent state.
The problem occurs in a similar fashion all over my database schema, so I'd like to solve this in a generic way if any possible. I had the following two ideas:
Check for database writes that might lead to inconsistencies in the DAL. This would be generic, but requires some additional database queries before an update and create queries are performed, so it will result in less performance.
Create an additional table for the clients-Projects relationship and make sure the relationships created this way are consistent. This also requires some additional select queries, but far less than in the first case. On the other hand it is not generic, so it is easier to miss something in the long run, especially when adding more tables / dependencies to the database.
What would you do? Is there any better solution I missed?
Edit: You might wonder why the Projects table has a CompanyID. This is because I want users to be able to add projects with and without clients. I need to keep track of which company (and therefore which website user) a clientless project belongs to, which is why a project needs a CompanyID.
I'd go with with the latter, having one or more tables that define the allowable relationships between entities.
Note, there's no circularity in the references you have, so the title is misleading.
What you have is the possibility of conflicting data, that's different.
Why do you have "CompanyID" in the project table? The ID of the company involved is implicitly given by the client you link to. You don't need it.
Remove that column and you've removed your problem.
Additionally, what is the purpose of the "name" column in the client table? Can you have a client with one name, differing from the name of the company?
Or is "client" the person at that company?
Edit: Ok with the clarification about projects without companies, I would separate out the references, but you're not going to get rid of the problem you're describing without constraints that prevent multiple references being made.
A simple constraint for your existing tables would be that not both the CompanyID and ClientID fields of the project row could be non-null at the same time.
If you want to use the table like this and avoid the all the new queries just put triggers on the table and when user tries to insert row with wrong data the trigger with stop him.
Best Regards,
Iordan
My first thought would be to create a special client record for each company with name "No client". Then eliminate the CompanyId from the Project table, and if a project has no client, use the "No client" record rather than a "normal" client record. If processing of such no-client's is special, add a flag to the no-client record to explicitly identify it. (I'd hate to rely on the name being "No Client" or something like that -- too fuzzy.)
Then there would be no way to store inconsistent data so the problem would go away.
In the end I implemented a completely generic solution which solves my problem without much runtime overhead and without requiring any changes to the database. I'll describe it here in case someone else has the same problem.
First off, the approach only works because the only table that other tables are referencing through multiple paths is the Companies table. Since this is the case in my database, I only have to check whether all n:1 referenced entities of each entity that is to be created / updated / deleted are referencing the same company (or no company at all).
I am enforcing this by deriving all of my Linq entities from one of the following types:
SingleReferenceEntityBase - The norm. Only checks (via reflection) if there really is only one reference (no matter if transitive or intransitive) to the Companies table. If this is the case, the references to the companies table cannot become inconsistent.
MultiReferenceEntityBase - For special cases such as the Projects table above. Asks all directly referenced entities what company ID they are referencing. Raises an exception if there is an inconsistency. This costs me a few select queries per CRUD operation, but since MultiReferenceEntities are much rarer than SingleReferenceEntities, this is negligible.
Both of these types implement a "CheckReferences" and I am calling it whenever the linq entity is written to the database by partially implementing the OnValidate(System.Data.Linq.ChangeAction action) method which is automatically generated for all Linq entities.

Resources