Relational database design considering partly dependent information? - database

I strongly having the feeling, I don't see the wood for the trees, so I need your help.
Think of the following two tables:
create table Category (Category_ID integer,
Category_Desc nvarchar2(500));
create table Text (Text_Id integer,
Text nvarchar2(1000),
Category_Id integer references Category.Category_Id);
This code follows no proper syntax, it's just to get an idea of the problem.
Consider the idea to save text parts for certain categories to use them in an interface, like messages ("You can't do that!", "Do this!",...), but also to create notes for other objects, e. g. like orders ("Important customer! Prioritize this order!").
Now for my question. Some of this text bits bring some more information with them, like if you add the "Important customer" note to an order, also the Order.Prio_Flag is set.
Now this is a very special information, only considering text used by the category Order_Note. I don't want to add this to the Text table, since most of the entries are not affected by this and the table would get more and more crowded by special cases for only the least part of its content.
I get the feeling, the design is flawed, but I also don't want a table for every category and keep this as general as possible.
Keep in mind, this is a simplified view of the problem.
TL:DR: How do I add information to a table's content without adding new attributes, because the new attribute would only be filled for the least number of entries.

Subtyping and dependent attributes are easy to do in a relational database. For example, if some Texts are important and need to have a dependent attribute (e.g. DisplayColor), you could add the following table to your schema:
CREATE TABLE ImportantText (
Text_Id integer NOT NULL ,
Display_Color integer NOT NULL ,
PRIMARY KEY (Text_Id),
CONSTRAINT ImportantTextSubtypeOfText
FOREIGN KEY (Text_Id) REFERENCES Text (Text_Id)
ON DELETE CASCADE ON UPDATE CASCADE
);
Many people think foreign key constraints establish relationships between entities. That's not what they're for. They're CONSTRAINTS, i.e. they limit the values in a column to be a subset of another column. In this way, a subtyping relation is established which can record additional properties.
In the table above, any element of ImportantText must be an element of Text, and will have all the attributes of Text (since it must be recorded in the Text table), as well as the additional attributes of ImportantText.

Related

Database theory: best way to have a "flags" table which could apply to many entities?

I'm building a data model for a new project I'm developing. Part of this data model involves entities having "flags".
A flag is simply a boolean value - either an entity has the flag or it does not have the flag. To that end I have a table simply called "flags" that has an ID, a string name, and a description. (An example of a flag might be "is active" or "should be displayed" or "belongs to group".)
So for example, any user in my users table could have none, one, or many flags. So I create a userFlags bridge table with user ID and flag ID. If the table contains a row for the given flag ID and user ID, that user has that flag.
Ok, so now I add another entity - say "section". Each section can also have flags. So I create a sectionFlags table to accommodate this.
Now I have another entity - "content", so again, "contentFlags".
And so on.
My final data model has basically two tables per entity, one to hold the entity and one for flags.
While this certainly works, it seems like there may be a better way to design my model, so I don't have to have so many bridge tables. One idea I had was a master "hasFlags" table with flag ID, item ID and item type. The item type could be an enumerated field only accepting values corresponding to known entities. The only problem there is that my foreign key for the entity will not work because each "item ID" could refer to a different entity. (I have actually used this technique in other data models, and while it certainly works, you lose referential integrity as well as things like cascade updates.)
Or, perhaps my data model is fine as-is and that's just the nature of the beast.
Any more-advanced experienced DB devs care to chime in?
The many-to-many relationships are one way to do it (and possibly faster than what I'm about to suggest because they can use integer key indexes).
The other way to do this is with a polymorphic relationship.
Your entity-to-flag table needs 2 columns as well as the foreign key link to the flag table;
other_key integer not null
other_type varchar(...) not null
And in those fields you store the foreign key of the relation in the integer and the type of the relation in the varchar. Full-on ORMs that support this sometimes store the class name of the foreign relation in the type column, to aid with object loading.
The downside here is that the integer can't be a true foreign key as it will contain duplicates from many tables. It also makes your querying a bit more interesting per-join than the many-to-many tables, but it does allow you to generalise your join in code.

DB Normalization and single field break outs

My question has probably been asked many times, but I can't quite find it (nor has googling been very good).
I'm trying to normalize our DB. Here is the example:
Say we currently have a single table:
Property
---------
id
name
type
types can either be:
multi-family
single-family
healthcare
commercial
I could break this into a separate table so that we have:
Property Prop_Type
-------- ----------
id prop_id
name type
type_id
According to 2-n, I should break this up. But what am I actually saving in performance? I agree that breaking up tables like this makes it easier for us to insert new types of real estate, or modify current ones. But assuming that this isn't very necessary, would this result in a performance increase? The field Property.type is holding up to a 32 byte string versus a Property.type_id which is similar (no?). Plus there is an additional table required in the second option, and a join every time we want to access that data. Finally, our DB is not that large (maybe tens of thousands of records), so space saving is not a priority.
Should I continue to normalize or should I hold off on these small individual breaks?
Thank you!
Should I continue to normalize or should I hold off on these small individual breaks?
Normalization to higher normal forms replaces a table by others using the same columns that join back to the original based on functional dependencies and join dependencies.
According to 2-n, I should break this up
Presumably you mean 2NF. You have not given any information to justify that. And what you discuss doing has nothing to do with normalization.
Looks like you undertand litte about normalization. Get a reference presenting and explaining its issues, definitions and procedures. Use them. Quote them.
But what am I actually saving in performance?
Normalization should be done regardless of performance. You change when justified by the demonstrated present value of changing to another particular design based on the ideal/original.
It's not meaningful to talk about a design's performance without having been given details for a particular DBMS implementation plus expected use. But roughly speaking introducing ids uses less space but causes more joins.
DBMSes exist to have information stored in tables queried by algebra and/or conditions as implemented by the DBMS. Just make the most straightforard design. You need to understand way more about schemas and querying before you will know enough to modify a design for performance.
I agree that breaking up tables like this makes it easier for us to insert new types of real estate,
No, it makes it harder. All you used to have to do is enter the type value you wanted in a Property row. With ids you have to add a Prop_Type row and use that type_id in a Property row.
If possible values for Property type are fixed then add a CHECK constraint on Property type:
CHECK(type IN ('multi-family','single-family','healthcare','commercial'))
(Otherwise, don't.)
If you want possible values for properties to be updated and queried without a schema change and there does not have to be a property for every type then that is something that your original design cannot express. But you still don't need to introduce ids; you can have a Prop_Type table with just a type column and a foreign key from Property type to Prop_Type type.
I think that this is not a normalization problem.
The type column is essentially a discrete type, i.e. has a finite set of values - currently multi-family, single-family, healthcare, commercial.
What you want is to control that no invalid value is inserted into the column. Your prop_type table and a foreign key constraint is one solution.
A more suitable solution is to use a CHECK CONSTRAINT on the column:
CREATE TABLE Property
(
id int PRIMARY KEY,
name ...,
type varchar(20) CONSTRAINT typeValues CHECK (type IN ('multi-family', 'single-family', 'healthcare', 'commercial'))
)
Going further there is no need to store the complete type string in every record. You could simply use a single character to encode the type:
CREATE TABLE Property
(
...
type char(1) CONSTRAINT typeValues CHECK (type IN ('M', 'S', 'H', 'C'))
)
When you present the type, e.g. in a GUI, you would need to translate them into user readable text. To enter a value you would use a dropdown in the GUI.

Database Design: Categories in their own table?

I am redesigning a few databases into one encompassing database, and I have noticed the previous designer(s) of the old databases like to store categories in their own tables. For example, say that there is a table boats(bid: integer, bname: string, color: integer), and in the application there is a drop-down box allowing the user to specify the colour of the boat, then there is a table color(cid: integer, cname: string). I would have not included the color table, and just put the colours as strings in the boats table. I realize that this decreases redundant storage of colour names, but is the added run-time cost of joining the boat table with the colour table "worth it"? Also the drop-downs are populated with SELECT cname FROM color statements, while I would have defined a view on SELECT DISTINCT color FROM boats to populate the drop-downs.
The example is simple, but this happens multiple times in the system I am redesigning, even for categories with only two options. This has resulted in many tables with only 2 fields. Some only have 1 field (I haven't figured out what those are for yet, but I think they are only to populate the drop-downs, and the actual tables contain the values as well).
I would personally keep them in their own table if this were my DB.
If you get into a situation where you get the requirement that Boats a,b and c can only come in silver and black then you will be thankful that you did. I've seen these types of requests bubble up down the road in a lot of projects.
If you are just concerned about the query complexity you could create a view that joins the information you need so you only need to query it once and with no JOIN.
If you are worried about the performance implications of the JOIN then I would look at creating the appropriate indexes or possibly an indexed view.
Good luck!
When you know a column should have a limited set of values, you should tell the dbms to enforce that limited set. The three most common ways to deal with that kind of requirement are
ignore it,
set a foreign key reference to a table of colors, and
use a CHECK() constraint against a list of colors.
Of those three, setting a foreign key to a table of colors tends to make life easiest.
I realize that this decreases redundant storage of colour names, but
is the added run-time cost of joining the boat table with the colour
table "worth it"?
This is a different issue. First, storing foreign key values is a form of data integrity, not a form of redundancy. Keys exist for two reasons: 1) to identify things in the real world, and 2) to be stored in other tables. (Because the thing the key identifies is relevant to the other table.)
Second, if you identify colors by assigning an arbitrary id number to them, you have to use a JOIN to get human-readable information. But colors, like many attributes, carry their identity with them. If you use the color's name itself ("red", "orange", etc.) or use a human-readable code for the name ("R", "O", etc.) you don't need a join. You do still need a table of colors (or a CHECK() constraint), because the column in boats has a limited set of values, and the dbms should enforce use of that limited set of values.
So you could do something like this.
create table boats (
boat_id integer primary key,
registered_name varchar(35) not null,
hull_color varchar(10) not null references hull_colors (color)
);
create table hull_colors (
color varchar(10) primary key
);
insert into hull_colors values ('red'),('orange'),('yellow') etc.
Both those tables are in 5NF.
It is generally better to have a normalized database.
However, in your example, you can use a Categories(ID, Type, Name) table and store the colors as ( 3, "Color", "Blue" ), ( 4, "Color", "Red" ), ... This way, you can store more categories in the same table and in the same time, store them separately. Populating a drop-down list will require a simple select of the form select ID, Name from Categories where Type = 'Color'.
EDIT: Note that this solution violates the first rule of database normalization, as #Catcall said. A 3NF table would be Colors(ID, Name). This way, you can refer to a certain color using its ID.
Using select distinct color from boats to populate a drop-down list has many disadvantages, for example, what if the Boats table contains no records. Then, your select will return nothing and the drop-down control will not be populated with any value. Another problem is when you have fields containing 'Red' and 'red' or similar. See more details on Database Normalization here
It sounds like those are lookup tables so that if the end user want's to add an additional color then they can add it to the database and it will then propagate down the UI. This also gets into normalization. If there is only one place where colors is referenced then the lookup table is not really necessary. However if there are multiple tables where colors are referenced for different things then the lookup table will save you a huge headache down the road.

Organizing database tables - large number of properties

I have a database that stores some users in it. Each user has its account settings, privacy settings and lots of other properties to set. The number of those properties started to grow and I could end up with 30 properties or so.
Till now, I used to keep it in "UserInfo" table having User and UserInfo related as One-To-Many (keeping a log of all changes). Putting it in a single "UserInfo" table doesn't sound nice and, at least in the database model, it would look messy. What's the solution?
Separating privacy settings, account settings and other "groups" of settings in separate tables and have 1-1 relations between UserInfo and each group of settings table is one solution, but would that be too slow (or much slower) when retrieving the data? I guess all data would not be presented on a single page at the same moment. So maybe having one-to-many relationships to each table is a solution too (keeping log of each group separately)?
If it's only 30 properties, I'd recommend just creating 30 columns. That's not too much for a modern database to handle.
But I would guess that if you ahve 30 properties today, you will continue to invent new properties as time goes on, and the number of columns will keep growing. Restructuring your table to add columns every day may become time-consuming as you get lots of rows.
For an alternative solution check out this blog for a nifty solution for storing lots of dynamic attributes in a "schemaless" way: How FriendFeed Uses MySQL.
Basically, collect all the properties into some format and store it in a single TEXT column. The format is semi-structured, that is your application can separate the properties if needed but you can also add more at any time, or even have different properties per row. XML or YAML or JSON are example formats, or some object serialization format supported by your application code language.
CREATE TABLE Users (
user_id SERIAL PRIMARY KEY,
user_proerties TEXT
);
This makes it hard to search for a given value in a given property. So in addition to the TEXT column, create an auxiliary table for each property you want to be searchable, with two columns: values of the given property, and a foreign key back to the main table where that particular value is found. Now you have can index the column so lookups are quick.
CREATE TABLE UserBirthdate (
user_id BIGINT UNSIGNED PRIMARY KEY,
birthdate DATE NOT NULL,
FOREIGN KEY (user_id) REFERENCES Users(user_id),
KEY (birthdate)
);
SELECT u.* FROM Users AS u INNER JOIN UserBirthdate b USING (user_id)
WHERE b.birthdate = '2001-01-01';
This means as you insert or update a row in Users, you also need to insert or update into each of your auxiliary tables, to keep it in sync with your data. This could grow into a complex chore as you add more auxiliary tables.

What is the best way to keep this schema clear?

Currently I'm working on a RFID project where each tag is attached to an object. An object could be a person, a computer, a pencil, a box or whatever it comes to the mind of my boss.
And of course each object have different attributes.
So I'm trying to have a table tags where I can keep a register of each tag in the system (registration of the tag). And another tables where I can relate a tag with and object and describe some other attributes, this is what a have done. (No real schema just a simplified version)
Suddenly, I realize that this schema could have the same tag in severals tables.
For example, the tag 123 could be in C and B at the same time. Which is impossible because each tag just could be attached to just a single object.
To put it simple I want that each tag could not appear more than once in the database.
My current approach
What I really want
Update:
Yeah, the TagID is chosen by the end user. Moreover the TagID is given by a Tag Reader and the TagID is a 128-bit number.
New Update:
The objects until now are:
-- Medicament(TagID, comercial_name, generic_name, amount, ...)
-- Machine(TagID, name, description, model, manufacturer, ...)
-- Patient(TagID, firstName, lastName, birthday, ...)
All the attributes (columns or whatever you name it) are very different.
Update after update
I'm working on a system, with RFID tags for a hospital. Each RFID tag is attached to an object in order keep watch them and unfortunately each object have a lot of different attributes.
An object could be a person, a machine or a medicine, or maybe a new object with other attributes.
So, I just want a flexible and cleaver schema. That allow me to introduce new object's types and also let me easily add new attributes to one object. Keeping in mind that this system could be very large.
Examples:
Tag(TagID)
Medicine(generic_name, comercial_name, expiration_date, dose, price, laboratory, ...)
Machine(model, name, description, price, buy_date, ...)
Patient(PatientID, first_name, last_name, birthday, ...)
We must relate just one tag for just one object.
Note: I don't really speak (or also write) really :P sorry for that. Not native speaker here.
You can enforce these rules using relational constraints. Check out the use of a persisted column to enforce the constraint Tag:{Pencil or Computer}. This model gives you great flexibility to model each child table (Person, Machine, Pencil, etc.) and at same time prevent any conflicts between tag. Also good that we dont have to resort to triggers or udfs via check constraints to enforce the relation. The relation is built into the model.
create table dbo.TagType (TagTypeID int primary key, TagTypeName varchar(10));
insert into dbo.TagType
values(1, 'Computer'), (2, 'Pencil');
create table dbo.Tag
( TagId int primary key,
TagTypeId int references TagType(TagTypeId),
TagName varchar(10),
TagDate datetime,
constraint UX_Tag unique (TagId, TagTypeId)
)
go
create table dbo.Computer
( TagId int primary key,
TagTypeID as 1 persisted,
CPUType varchar(25),
CPUSpeed varchar(25),
foreign key (TagId, TagTypeID) references Tag(TagId, TagTypeID)
)
go
create table dbo.Pencil
( TagId int primary key,
TagTypeId as 2 persisted,
isSharp bit,
Color varchar(25),
foreign key (TagId, TagTypeID) references Tag(TagId, TagTypeId)
)
go
-----------------------------------------------------------
-- create a new tag of type Pencil:
-----------------------------------------------------------
insert into dbo.Tag(TagId, TagTypeId, TagName, TagDate)
values(1, 2, 'Tag1', getdate());
insert into dbo.Pencil(TagId, isSharp, Color)
values(1, 1, 'Yellow');
-----------------------------------------------------------
-- try to make it a Computer too (fails FK)
-----------------------------------------------------------
insert into dbo.Computer(TagId, CPUType, CPUSpeed)
values(1, 'Intel', '2.66ghz')
Have a Tag Table with PK identity insert of TagID.
This will ensure that each TagID only shows up once no matter what...
Then in the Tag Table have a TagType column that can either be free form (TableName) or better yet have a TagType table with entries A,B,C and then have a FK in Tag pointing TagType.
I would move the Tag attributes into Table A,B,C to minimize extra data in Tag or have a series of Junction Tables between Tag and A,B, and C
EDIT:
Assuming the TagID is created when the object is created this will work fine (Insert into Tag first to get TagID and capture it using IDENTITY_INSERT)
This assumes users cannot edit the TagID itself.
If users can choose the TagID then still use a Tag Table with the TagID but have another field called DisplayID where the user can type in a number. Just put on a unique constraint on Tag.DisplayID....
EDIT:
What attributes are you needing and are they nullable? If they are different for A, B, and C then it is cleaner to put them in A, B, and C especially if there might be some for A and B but not C...
talked with Raz to clear up what he's trying to do. What he's wanting is a flexable way to store attributes related to tags. Tags can one of multiple types of objects, and each object has a specific list of attributes. he also wants to be able to add objects/attributes without having to change the schema. here's the model i came up with:
if each tag can only be in a, b, or c only once, i'd just combine a, b, and c into one table. it'd be easier to give you a better idea of how to build your schema if you gave an example of exactly what you're wanting to collect.
to me, from what i've read, it sounds like you have a list of tags, and a list of objects, and you need to assign a tag to an object. if that is the case, i'd have a tags table, and objects table, and a ObjectTag table. in the object tab table you would have a foreign key to the tag table and a foreign key to the object table. then you make a unique index on the tag foreign key and now you've enforced your requirement of only using a tag once.
I would tackle this using your original structures. Relational databases are a lot better at aggregating/combining atomic data than they are at parsing complex data structures.
Keep the design of each "tag-able" object type in its own table. Data types, check constraints, default values, etc. are still easily implemented this way. Also, continue to define a FK from each object table to the Tags table.
I'm assuming you already have this in place, but if you place a unique constraint on the TagId column in each of the object tables (A, B, C, etc.) then you can guarantee uniqueness within that object type.
There are no built-in SQL Server constraints to guarantee uniqueness among all the object types, if implemented as separate tables. So, you will have to make your own validation. An INSTEAD OF trigger on your object tables can do this cleanly.
First, create a view to access the TagId list across all your object tables.
CREATE VIEW TagsInUse AS
SELECT A.TagId FROM A
UNION
SELECT B.TagId FROM B
UNION
SELECT C.TagId FROM C
;
Then, for each of your object tables, define an INSTEAD OF trigger to test your TagId.
CREATE TRIGGER dbo.T_IO_Insert_TableA ON dbo.A
INSTEAD OF INSERT
AS
IF EXISTS (SELECT 0 FROM dbo.TagsInUse WHERE TagId = inserted.TagId)
BEGIN;
--The tag(s) is/are already in use. Create the necessary notification(s).
RAISERROR ('You attempted to re-use a TagId. This is not allowed.');
ROLLBACK
END;
ELSE
BEGIN;
--The tag(s) is/are available, so proceed with the INSERT.
INSERT INTO dbo.A (TagId, Attribute1, Attribute2, Attribute3)
SELECT i.TagId, i.Attribute1, i.Attribute2, i.Attribute3
FROM inserted AS i
;
END;
GO
Keep in mind that you can also (and probably should) encapsulate that IF EXISTS test in a T-SQL function for maintenance and performance reasons.
You can write supplementary stored procedures for doing things like finding what object type a TagId is associated with.
Pros
You are still taking advantage of SQL Server's data integrity features, which are all quite fast and self-documenting. Don't underestimate the usefulness of data types.
The view is an encapsulation of the domain that must be unique without combining the underlying sets of attributes. Now, you won't have to write any messy code to decipher the object's type. You can base that determination by which table contains the matching tag.
Your options remain open...
Because you didn't store everything in an EAV-friendly nvarchar(300) column, you can tweak the data types for whatever makes the most sense for each attribute.
If you run into any performance issues, you can index the view.
You (or your DBA) can move the object tables to different file groups on different disks if you need to balance things out and help with parallel disk I/O. Think of it as a form of horizontal partitioning. For example, if you have 8 times as many RFID tags applied to medicine containers as you have for patients, you can place the medicine table on a different disk without having to create the partitioning function that you would need for a monolithic table (one table for all types).
If you need to eventually partition your tables vertically (for archiving data onto a read-only partition), you can more easily create a partitioning function for each object type. This would be useful where the business rules do
Most importantly, implementing different business rules based on object type is much simpler. You don't have to implement any nasty conditional logic like "IF type = 'needle' THEN ... ELSE IF type = 'patient' THEN ... ELSE IF....". If you need to apply different rules, then apply them to the relevant object table without having to test a "type" value.
Cons
Triggers have to be maintained. However, this would have to be done in your application anyway, so you are performing the same data integrity checking at the database. That means that you will have no extra network overhead and this will be available for any application that uses this database.
What you're describing is a classical "table-per-type" ORM mapping. Entity Framework has built-in support of this, which you should look into.
Otherwise, I don't think most databases have easy integrity constraints that are enforced over primary keys of multiple tables.
However, is there any reason why you can't just use a single tags table to hold all the fields? Use a type field to hold the type of object. NULL all the irrelevant fields -- this way they don't consume disk space. You'll end up with far fewer tables (only one) that you can maintain as one single coherent object; it also makes you write far fewer SQL queries to work on tags that may span multiple object types.
Implementing it as one single table also saves you disk space because you can implement tiers of inheritance -- for example, "patient" and "doctor" and "nurse" can be three different object types, each having similar fields (e.g. firstname, lastname etc.) and some unique fields. Right now you'll need three tables, with duplicated fields.
It is also simpler when you add an object type. Before, you need to add a new table, and duplicate some SQL statements that span multiple object types. Now you only need to add new fields to the same table (maybe reuse some). The SQL you need to change are far fewer.
The only reason why you won't go with one single table is when the number of fields make a row too large to fit inside a SQL-Server page (which I believe is 8K). Then SQL will complain and won't allow you to add any more fields. The solution, in this case, is to adopt an ORM tool (like Entity Framework), and then "reuse" fields. For example, if "Field1" is only used by object type #1, there is no reason why object type #3 can't use it to store something as well. You only need to be able to distinguish it in your programs.
You could have the Tags table such that it can have a pointer to any of those tables, and could include a Type that tells you which of the tables it is
Tags
-
ID
Type (A,B, or C)
A (nullable)
B (nullable)
C (nullable)
A
-
ID
(other attributes)

Resources