Extendable database schema for contacts (social)

Extendable database schema for contacts (social) - sql-server

I have an old application that needs upgrading. Doesn't everything now days?
The existing DB schema consists of predefined fields like phone, fax, email. Obviously with the social explosion over the last 5-7 years (or longer depending on your country) end users need more control over creating contact cards the way they see fit rather than just what I think might be useful.
Im concerned here with "digital" addresses. i.e. One line type addresses. phone=ccc ccc ccc ccc etc
Since physical addresses are pretty standard in terms of requirements in this case users will have to use what they are given (location, postal, delivery) in order to keep the scope managable.
So I'm wondering what the best practice format for storing digital info is. To me it seems I have two choices:
A simple 4 field table (ContactId, AddressTypeId, Address, FormatterId)
1000, "phone", "ccc ccc ccc ccc", phoneformatter
1000, "facebook", "myfacebook", facebookformatter
This would then be JOINED anywhere it's need. The table would get massive though and the join performance would degrade over time i suspect.
A json blob that would require additional processing once read (ContactId, Addresses)
1000, {{"phone": "ccc ccc ccc ccc"}, {"facebook": "myfacebook"}}
Or ... something else.
This db is for use in a given country by customers only trading domestically with client bases ranging from 3000-12000 accounts and then however many contacts per account - averages about 10 in current system.
My primary concern is user flexibility but performance is a key consideration in that. So I dunno, just do whatever and throw heaps of hardware at it ;)
Application is in C# if that makes any difference re: post query processing.

I would not go for the JSON blob. This will be nasty if you need to answer any queries like:-
Does anyone have me in their Facebook contacts?
What's the most popular type of social media contact?
You would be forced to parse the JSON for every record and be unable to create a simple index.
Your additional solution is nearly correct, however FormatterId would need to be on a AddressType table. What you have is not normalised as FormatterId would depend only on AddressTypeId. So you would have three tables:-
Contact
ContactAddress
AddressType
You haven't stated if you need to store two addresses of the same type against a single contact. e.g. if someone has two twitter accounts. Answering this question will allow you to define the correct primary key on ContactAddress. It would either be (ContactId, AddressTypeId) if you can only have one of each type per contact or create a synthenic key (ContactAddressId).

Well, I believe you have a table named contact
contact(contactid, contact details, other details)
and now you want to remove this contact details from the contact table because the contact details may contain digital address, phone number and all.
But the table you are considering
(ContactId, AddressTypeId, Address, FormatterId) is not in normal form and you can't uniquely identify a tuple until you read all the four columns which is bad and in this case indexing also not going to help you.
So better if you have if separate table for each type of the digital address, and have indexing on contactID
facebookdetails(contactid, rest of the details)
phonedetails(contactid, rest of the details)
And then the query can be join of all the tables, it will not degrade the performance.
Hope this will help :)

Related

Common identifier for a person across disparate systems

Not sure which is the best Stack Exchange site for this, so will try my hand here.
I have a web application that stores user disciplinary data for organisations. Rather than clients enter their staff into multiple systems, some want to push the basic personnel data into ours (data such as First Name, Surname, DOB, Job Title etc) from their source (e.g. HR/ERP) databases.
Our clients are using a range of existing systems to store their data, such as Oracle, SAP, JD Edwards, etc.
I am familiar with the technical methods to get this data (e.g. web service, web API), but not for a case such as when a person's surname changes (e.g. Janet Smith gets married and becomes Janet Doe). Unless there is a unique identifier for that person across both systems, I can't see how that change can be managed reliably.
How is this process best-managed please? Is an additional field added to the destination database that contains the UID of the source data? Or, do both parties agree on a common field, e.g. employee number, that never changes?

This issue arises in many circumstances. One case is in the Texas school system where students are tracked longitudinally through numerous education subsystems. A social security number, providing a unique identifier in some cases (although not all) was considered too sensitive for use. Thus, a unique identifier has been generated for each student and staff member. This is part of the permanent information associated with each individual, regardless of employment change, location change, or name change.
This link describes the rationale for the unique id.
This link is the documentation on the Texas Student Data System (TSDS) unique identifier. You might find the XML examples at the end of the document of most interest. Much of the information involves submitting requests for an id where demographic information is needed for disambiguation.
Basically, something similar to a Java UUID as an extra field in the database should be sufficient to achieve your aim.
Hope this helps.

Yes, the UID is the only solution. This problem comes up in medical systems too, for example. Another is photos, I'm not sure which causes more problems!

I know approach which is using "external_id" field for that. Several external ids can be exploited in case of many systems.

Separate tables for name and email addresses?

I've been fighting my initial decision since I've began work on my database. I'm debating whether or not I need a separate table for email addresses. My database looks like this:
people(id, first_name, last_name, email)
addresses(id, address, street, city, state, zip, latitude, longitude)
addresses_people(id, person_id, address_id)
phone_numbers(id, person_id, phone_number, type)
I figured I didn't need a separate table for email addresses as I only wanted one per person regardless if they have more. The problem I'm seeming to have though is that some people will NOT have email addresses. Very often I will be storing children in the people table. Now it seems like it would be better design if I put the email addresses in a separate table to avoid the thousands of empty email fields I'll have.
It's a huge hassle to change this now as the app is already in production somewhat, but changing it now as opposed to a year or two from now would be exponentially easier. Is it worth a day or two to change the emails to another table?

In my opinion you are over-engineering. Optional email field is fine. Actually having a separate table might introduce much bigger overhead.
The only reason for a separate table is to model 1-N relationship if you expect the user to have more than one e-mail.

I believe there is no need to change it. Having some fields in a row be empty is not a big deal most of the time. In fact, it is common in most databases. The design of the database should depend on the objects you are trying to model primarily and not be concerned with separating things into tables simply for storage reasons unless you have serious storage constraints or other extenuating circumstances.
See Database design - empty fields which also shed a lot of light on your question.

Why use a 1-to-1 relationship in database design?

I am having a hard time trying to figure out when to use a 1-to-1 relationship in db design or if it is ever necessary.
If you can select only the columns you need in a query is there ever a point to break up a table into 1-to-1 relationships. I guess updating a large table has more impact on performance than a smaller table and I'm sure it depends on how heavily the table is used for certain operations (read/ writes)
So when designing a database schema how do you factor in 1-to-1 relationships? What criteria do you use to determine if you need one, and what are the benefits over not using one?

From the logical standpoint, a 1:1 relationship should always be merged into a single table.
On the other hand, there may be physical considerations for such "vertical partitioning" or "row splitting", especially if you know you'll access some columns more frequently or in different pattern than the others, for example:
You might want to cluster or partition the two "endpoint" tables of a 1:1 relationship differently.
If your DBMS allows it, you might want to put them on different physical disks (e.g. more performance-critical on an SSD and the other on a cheap HDD).
You have measured the effect on caching and you want to make sure the "hot" columns are kept in cache, without "cold" columns "polluting" it.
You need a concurrency behavior (such as locking) that is "narrower" than the whole row. This is highly DBMS-specific.
You need different security on different columns, but your DBMS does not support column-level permissions.
Triggers are typically table-specific. While you can theoretically have just one table and have the trigger ignore the "wrong half" of the row, some databases may impose additional limits on what a trigger can and cannot do. For example, Oracle doesn't let you modify the so called "mutating" table from a row-level trigger - by having separate tables, only one of them may be mutating so you can still modify the other from your trigger (but there are other ways to work-around that).
Databases are very good at manipulating the data, so I wouldn't split the table just for the update performance, unless you have performed the actual benchmarks on representative amounts of data and concluded the performance difference is actually there and significant enough (e.g. to offset the increased need for JOINing).
On the other hand, if you are talking about "1:0 or 1" (and not a true 1:1), this is a different question entirely, deserving a different answer...
See also: When I should use one to one relationship?

Separation of duties and abstraction of database tables.
If I have a user and I design the system for each user to have an address, but then I change the system, all I have to do is add a new record to the Address table instead of adding a brand new table and migrating the data.
EDIT
Currently right now if you wanted to have a person record and each person had exactly one address record, then you could have a 1-to-1 relationship between a Person table and an Address table or you could just have a Person table that also had the columns for the address.
In the future maybe you made the decision to allow a person to have multiple addresses. You would not have to change your database structure in the 1-to-1 relationship scenario, you only have to change how you handle the data coming back to you. However, in the single table structure you would have to create a new table and migrate the address data to the new table in order to create a best practice 1-to-many relationship database structure.

Well, on paper, normalized form looks to be the best. In real world usually it is a trade-off. Most large systems that I know do trade-offs and not trying to be fully normalized.
I'll try to give an example. If you are in a banking application, with 10 millions passbook account, and the usual transactions will be just a query of the latest balance of certain account. You have table A that stores just those information (account number, account balance, and account holder name).
Your account also have another 40 attributes, such as the customer address, tax number, id for mapping to other systems which is in table B.
A and B have one to one mapping.
In order to be able to retrieve the account balance fast, you may want to employ different index strategy (such as hash index) for the small table that has the account balance and account holder name.
The table that contains the other 40 attributes may reside in different table space or storage, employ different type of indexing, for example because you want to sort them by name, account number, branch id, etc. Your system can tolerate slow retrieval of these 40 attributes, while you need fast retrieval of your account balance query by account number.
Having all the 43 attributes in one table seems to be natural, and probably 'naturally slow' and unacceptable for just retrieving single account balance.

It makes sense to use 1-1 relationships to model an entity in the real world. That way, when more entities are added to your "world", they only also have to relate to the data that they pertain to (and no more).
That's the key really, your data (each table) should contain only enough data to describe the real-world thing it represents and no more. There should be no redundant fields as all make sense in terms of that "thing". It means that less data is repeated across the system (with the update issues that would bring!) and that you can retrieve individual data independently (not have to split/ parse strings for example).
To work out how to do this, you should research "Database Normalisation" (or Normalization), "Normal Form" and "first, second and third normal form". This describes how to break down your data. A version with an example is always helpful. Perhaps try this tutorial.

Often people are talking about a 1:0..1 relationship and call it a 1:1. In reality, a typical RDBMS cannot support a literal 1:1 relationship in any case.
As such, I think it's only fair to address sub-classing here, even though it technically necessitates a 1:0..1 relationship, and not the literal concept of a 1:1.
A 1:0..1 is quite useful when you have fields that would be exactly the same among several entities/tables. For example, contact information fields such as address, phone number, email, etc. that might be common for both employees and clients could be broken out into an entity made purely for contact information.
A contact table would hold common information, like address and phone number(s).
So an employee table holds employee specific information such as employee number, hire date and so on. It would also have a foreign key reference to the contact table for the employee's contact info.
A client table would hold client information, such as an email address, their employer name, and perhaps some demographic data such as gender and/or marital status. The client would also have a foreign key reference to the contact table for their contact info.
In doing this, every employee would have a contact, but not every contact would have an employee. The same concept would apply to clients.

Just a few samples from past projects:
a TestRequests table can have only one matching Report. But depending on the nature of the Request, the fields in the Report may be totally different.
in a banking project, an Entities table hold various kind of entities: Funds, RealEstateProperties, Companies. Most of those Entities have similar properties, but Funds require about 120 extra fields, while they represent only 5% of the records.

Should User and Address be in separate tables?

Currently my users table has the below fields
Username
Password
Name
Surname
City
Address
Country
Region
TelNo
MobNo
Email
MembershipExpiry
NoOfMembers
DOB
Gender
Blocked
UserAttempts
BlockTime
Disabled
I'm not sure if I should put the address fields in another table. I have heard that I will be breaking 3NF if I don't although I can't understand why. Can someone please explain?

There are several points that are definitely not 3NF; and some questionable ones in addition:
Could there could be multiple addresses per user?
Is an address optional or mandatory?
Does the information in City, Country, Region duplicate that in Address?
Could a user have multiple TelNos?
Is a TelNo optional or mandatory?
Could a user have multiple MobNos?
Is a MobNo optional or mandatory?
Could a user have multiple Emails?
Is an Email optional or mandatory?
Is NoOfMembers calculated from the count of users?
Can there be more than one UserAttempts?
Can there be more than one BlockTime per user?
If the answer to any of these questions is yes, then it indicates a problem with 3NF in that area. The reason for 3NF is to remove duplication of data; to ensure that updates, insertions and deletions leave the data in consistent form; and to minimise the storage of data - in particular there is no need to store data as "not yet known/unknown/null".
In addition to the questions asked here, there is also the question of what constitutes the primary key for your table - I would guess it is something to do with user, but name and the other information you give is unlikely to be unique, so will not suffice as a PK. (If you think name plus surname is unique are you suggesting that you will never have more than one John Smith?)
EDIT:
In the light of further information that some fields are optional, I would suggest that you separate out the optional fields into different tables, and establish 1-1 links between the new tables and the user table. This link would be established by creating a foreign key in the new table referring to the primary key of the user table. As you say none of the fields can have multiple values then they are unlikely to give you problems at present. If however any of these change, then not splitting them out will give you problems in upgrading the application and the data to support the application. You still need to address the primary key issue.

As long as every user has one address and every address belongs to one user, they should go in the same table (a 1-to-1 relationship). However, if users aren't required to enter addresses (an optional relationship) a separate table would be appropriate. Also, in the odd case that many users share the same address (e.g. they're convicts in the same prison), you have a 1-to-many relationship, in which case a separate table would be the way to go. EDIT: And yes, as someone pointed out in the comments, if users have multiple address (a 1-to-many the other way around), there should also be separate tables.

Just as point that I think might help someone in this question, I once had a situation where I put addresses right in the user/site/company/etc tables because I thought, why would I ever need more than one address for them? Then after we completed everything it was brought to my attention by a different department that we needed the possibility of recording both a shipping address and a billing address.
The moral of the story is, this is a frequent requirement, so if you think you ever might want to record shipping and billing addresses, or can think of any other type of address you might want to record for a user, go ahead and put it in a separate table.
In today's age, I think phone numbers are a no brainer as well to be stored in a separate table. Everyone has mobile numbers, home numbers, work numbers, fax numbers, etc., and even if you only plan on asking for one, people will still put two in the field and separate them by a semi-colon (trust me). Just something else to consider in your database design.

the point is that if you imagine to have two addresses for the same user in the future, you should split now and have an address table with a FK pointing back to the users table.
P.S. Your table is missing an identity to be used as PK, something like Id or UserId or DataId, call it the way you want...

By adding them to separate table, you will have a easier time expanding your application if you decide to later. I generally have a simple user table with user_id or id, user_name, first_name, last_name, password, created_at & updated_at. I then have a profile table with the other info.
Its really all preference though.

You should never group two different types of data in a single table, period. The reason is if your application is intended to be used in production, sooner or later different use-cases will come which will need you to higher normalised table structure.
My recommendation - Adhere to SOLID principles even in DB design.

Table "Inheritance" in SQL Server

I am currently in the process of looking at a restructure our contact management database and I wanted to hear peoples opinions on solving the problem of a number of contact types having shared attributes.
Basically we have 6 contact types which include Person, Company and Position # Company.
In the current structure all of these have an address however in the address table you must store their type in order to join to the contact.
This consistent requirement to join on contact type gets frustrating after a while.
Today I stumbled across a post discussing "Table Inheritance" (http://www.sqlteam.com/article/implementing-table-inheritance-in-sql-server).
Basically you have a parent table and a number of sub tables (in this case each contact type). From there you enforce integrity so that a sub table must have a master equivalent where it's type is defined.
The way I see it, by this method I would no longer need to store the type in tables like address, as the id is unique across all types.
I just wanted to know if anybody had any feelings on this method, whether it is a good way to go, or perhaps alternatives?
I'm using SQL Server 05 & 08 should that make any difference.
Thanks
Ed

I designed a database just like the link you provided suggests. The case was to store the data for many different technical reports. The number of report types is undefined and will probably grow to about 40 different types.
I created one master report table, that has an autoincrement primary key. That table contains all common information like customer, testsite, equipmentid, date etc.
Then I have one table for each report type that contains the spesific information relating to that report type. That table have the same primary key as the master and references the master as well.
My idea for splitting this into different tables with a 1:1 relation (which normally would be a no-no) was to avoid getting one single table with a huge number of columns, that gets very difficult to maintain as your constantly adding columns.
My design with table inheritance gave me segmented data and expandability without beeing difficult to maintain. The only thing I had to do was to write special a special save method to handle writing to two tables automatically. So far I'm very happy with the design and haven't really found any drawbacks, except for a little more complicated save method.

Google on "gen-spec relational modeling". You'll find a lot of articles discussing exactly this pattern. Some of them focus on table design, while others focus on an object oriented approach.
Table inheritance pops up in a few of them.

I know this won't help much now, but initially it may have been better to have an Entity table rather than 6 different contact types. Then each Entity could have as many addresses as necessary and there would be no need for type in the join.

You'll still have the problem that if you want the sub-type fields and you have only the master contact, you'll have to know what table to go looking at - or else join to all of them. But otherwise this is a workable solution to a common problem.
Another possibility (fairly similar in structure, but different in how you think of it) is to simply put all your contacts into one table. Then for the more specific fields (birthday say for people and department for position#company) create separate tables that are associated with that contact.
Contact Table
--------------
Name
Phone Number
Address Table
-------------
Street / state, etc
ContactId
ContactBirthday Table
--------------
Birthday
ContactId
Departments Table
-----------------
Department
ContactId
It requires a different way of thinking of things though - instead of thinking of people vs. companies, you think of the various functional requirements for the task at hand - if you want to send out birthday cards, get all the contacts that have birthdays associated with them, etc..

I'm going to go out on a limb here and suggest you should rethink your normalization strategy (as you seem to be lucky enough to be able to rethink your schema quite fundamentally). If you typically store an address for each contact, then your contact table should have the address fields in it. Alternatively if the address is stored per company then the address should be stored in the company table and your contacts linked to that company.
If your contacts only have one address, or one (or even 3, just not 'many') instance of the other fields, think about rationalizing them into a single table. In my experience having a few null fields is a far better alternative than needing left joins to data you aren't sure exists.
Fortunately for anyone who vehemently disagrees with me you did ask for opinions! :) IMHO you should only normalize when you really need to. Where you are rethinking schemas, denormalization should be considered at every opportunity.

When you have a 7th type, you'll have to create another table.

I'm going to try this approach. Yes, you have to create new tables when you have a new type, but since this table will probably have different columns, you'll end up doing this anyway if you don't use this scheme.
If the tables that inherit the master don't differentiate much from one another, I'd recommend you try another approach.

May I suggest that we just add a Type table. Ie a person has an address, name etc then the student, teacher as each use case presents its self we have a PersonType table that has an entry from the person table to n types and the subsequent new tables teacher, alien, singer as the system eveolves...

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight