What's the best method of storing a large number of booleans in a database table?
Should I create a column for each boolean value or is there a more optimal method?
Employee Table
IsHardWorking
IsEfficient
IsCrazy
IsOverworked
IsUnderpaid
...etc.
I don't see a problem with having a column for each boolean. But if you foresee any future expansion, and want to use the table only for booleans, then use a 2-column table with VARIABLE and VALUE columns, with a row for each bool.
If the majority of employees will have the same values across a large sample size, it can be more efficient to define a hierarchy, allowing you to establish default values representing the norm, and override them per employee if required.
Your employee table no longer stores these attributes. Instead I would create a definition table of attributes:
| ATTRIBUTE_ID | DESCRIPTION | DEFAULT |
| 1 | Is Hard Working | 1 |
| 2 | Is Overpaid | 0 |
Then a second table joining attributes to Employees:
| EMPLOYEE_ID | ATTRIBUTE_ID | OVERRIDE |
| 2 | 2 | 1 |
Given two employees, employee with ID 1 doesn't have an override entry, and thus inherits the default attribute values (is hard working, is not overpaid), however employee 2 has an override for attribute 2 - Is Overpaid, and is thus both hard working and overpaid.
For integrity you could place a unique constraint on the EMPLOYEE_ID and ATTRIBUTE_ID columns in the override table, enforcing you can only override an attribute once per employee.
Something to consider: how often will you be adding/changing/removing these booleans? If they're not likely to change then you'll probably like having them as individual columns. Many databases will probably pack them for you, especially if they're adjacent in the row, so they'll be stored efficiently.
If, on the other hand, you see yourself wanting to add/change/remove these booleans every once in a while you might be better served by something like (excuse PostgreSQL-isms and shoddy names):
CREATE TABLE employee_qualities (
id SERIAL8 PRIMARY KEY,
label TEXT UNIQUE
);
CREATE TABLE employee_employee_qualities (
employee_id INT8 REFERENCES employee (id),
quality_id INT8 REFERENCES employee_qualities (id),
UNIQUE (employee_id, quality_id)
);
A column for each is the best representation of your business requirements. You could combine a bunch of bools into a single int column and use bit masks to read the values, but this seems unnecessarily complex, and is something I would consider only if there was some high-end performance need for it.
Also, if you are using sql server, up to 8 bit fields get combined internally into a single int, so the performance thing is sort-of done for you already. (I don;t know if other dbs do this.)
Related
For an existing database we are considering to improve part of the database design.
25 tables have very similar structures, about 90% identical columns & data types. And fairly frequent changes to the tables, for example we may need to add 2 new columns to 7 of these 25 tables. A few months later the 2 new columns may be required in 5 further tables, etc. We also get questions like how many rows in these tables have IsActive (see example below) = TRUE. This currently means creating 25 SQL statements and the statements are much more complex than this simple example. It just feels wrong to query 25 tables and then combine the results.
One option we discussed would be to store all data in a master table. However in total this would mean having quite a wide table and quite a lot of NULL values.
A further idea we discussed is to keep the 25 tables and create a master view, which combines these tables. The view would however need a lot of manuall maintenance and the update could get forgotten & the view would still work.
In database design one of the main concepts is: "For maximum flexibility, data is stored in columns, not in column names.", which leads us to the main question. Does anyone have experience in storing columns in a table? The columns actually contain filter criteria for business logic.
Here is an example:
Table 1: Business Rule 1
CustomerID (int) | IsPremiumCust (bool) | HasCreditCard (bool) | IsActive (bool) | OrderThreshold (int)
Table 2: Business Rule 2
CustomerID (int) | IsPremiumCust (bool) | HasCreditCard (bool) | IsActive (bool) | Discount (int)
further 23 tables like these. All with more columns than in this examples.
Suggestion: Criteria table
Criteria ID | Criteria | Data Type
1 | IsPremiumCust | bool
2 | HasCreditCard | bool
3 | IsActive | bool
4 | OrderThreshold | int
5 | Discount | int
Suggestion: Business Rule table
Business Rule ID | Name
1 | Business Rule 1
2 | Business Rule 2
Suggestion: Intersection table
CustomerID | Business Rule ID | Criteria ID | Criteria Value
------------------------------------------------------------
1 | 1 | 1 | TRUE
2 | 2 | 1 | FALSE
I know this doesn't really work, as the Criteria Value field could have different data types. However I hope someone might have had a similar situation and can think of a full solution for this question.
This would allow us to add criteria without having to keep changing many table structures.
It sounds like you should have one table that has the core fields from your 25 common tables, with an additional field for the type of record that corresponds to the current existing table names. Then, you want one or a few supplemental tables that use the primary key for you new core table and also store just the additional fields needed by each type of record. If you find yourself with a new set of columns that only apply to a handful of existing tables, that's fine. You only need records for those records from the core table in your new supplemental table. And when those columns expand to include more of your original tables, adding records to the supplemental table is easy. You can still build a master view from this, if you need it.
I think reply to you question is good property inheritance tree for end entity. If tree will be optimized for problem domain you will be have efficient database scheme without null values. Problem with quantity of sql statements you can close by suitable ORM.
I have a table for some uses like this example:
A table for objects named tblObject that needs to store some properties; So, I can use one of these solutions:
Add a field for each property and with its own data type.
Just add one field of foreign key that related to below model that in tblObjProp I need another field for storing variant values:
+-----------+ +------------+ +-------------+
| tblObject | 1-----* | tblObjProp | *-----1 | tblProperty |
+-----------+ +------------+ +-------------+
I think using this solutions is based on properties count, As I don't know the count, I need to have a reference like a recommendation in this site to identify that count or any other recommendation.
And also I need some recommendation for data type of value field in tblObjProp that in SQL I have sql_variant or nvarchar(max) or etc.
If answers are related to RDBMS, I need that for SQL Server 2012 and Oracle 11g.
If I have a table like Person with columns Id, FirstName, LastName, BirthDate, Comments, NID, ... there is a choice for each column:
Make it static in the Person table.
Add it to tblProperty.
That I want to know I should always select the second choice or for example for Comments or NID first choice is better?
if you know the number of properties, and this number doesn't change very often, then i would go for solution 1 as this allows correct indexing of the fields (and to my mind is correct database design).
But if the number of properties can vary for each row, and properties are added frequently, then i have seen solution 2 (google "key-value pair"). The downside when you get a large number (millions in my experience) of items to which you hold properties, this solution becomes very slow.
I'm building this gaming portal and I have some database concerns. Currently I have about 10 tables, but I think they will be more than 20 when I'm finished programming. Anyway, I want to create some sort of relationships between the different tables (somewhat like WordPress). That table will hold any relation that one row from table A has to a row in table B. And what I came up with is the following:
table relationships
| rs_id | rs_type | rs_alpha | rs_beta |
rs_id -> just an id
rs_type -> the type of relation
rs_alpha -> related table #1 and row id
rs_beta -> related table #2 and row id
examples:
| 1 | cover | games:153 | images:318 |
| 2 | tag | news:183 | tags:18 |
| 3 | group_admin | users:918 | group:75 |
...
This might just do it, but here it comes my concerns:
1. This table is going to grow so fast that in no time there might be over 100,000 rows which will slow the load time.
2. To extract info I'll have to explode every call which might slow down the load time.
3. I might divide table name from id (rs_alpha, rs_beta), yet that might also slow down the load time.
Thank you and I'm open to any other solutions that might be better than this one :)
If you have time you can download my db structure from here to see what it looks like:
demirevdesign.com/public/pcanvil.sql.gz
(The addon_ tables will become the relationships table)
As far as I understand, relationship type itself defines tables involved , so no need in storing table names.
Also, if you refactor your schema and add a common parent table for all entities that might be involved in relationship, you won't need to care about table name at all , you just store id of that new table.
Finally, relationship always has start date and may have end date, I'd suggest adding this attributes to relationships table.
As to performance, it's hard to answer without seeing how you are going to query the table. I guess in general partitioning by relationship type column will be beneficial
Here is the summary of my question then i'll describe it in more details :
I read about using the parametrized data modeling method instead of using the standard relational data modeling when building semantic web application,i think we'll lose 90% of normalization if we used this method,If I want to design the database of my semantic web application should i use this way? what is the practical value ?
In More Details :
I've read a lot of articles around this, in this book "Programming the semantic web - Toby Segaran, Colin Evans, and Jamie Taylor" at page 14 they tell us to use parametrized Data modeling to get Semantic Relationships instead of the standard relational database described by this example:
in the standard Relational Database :
Venue : [ ID(PK), Name, Address ]
Restaurant : [ ID(PK), VenueID(FK), CuisineID]
Bar : [ ID(PK), VenueID(FK), DJ?, Specialty ]
Hours : [ VenueID(FK), Day, Open, Close ]
For Semantic Relationships : One table only !!! Fully parameterized venues
Properties : [ VenueID,Field, Value ]
Example:
VenueID _ Field____Value
1__Cuisine__Deli
1__Price__ $
1__Name__Deli Llama
1__Address__Peachtree Rd
2__Cuisine__Chinese
2__Price__ $$$
2__Specialty Cocktail __ Scorpion Bowl
2__DJ?__No
2__Name__ Peking Inn
2__Address Lake St
3__Live Music? __ Yes
3__Music Genre__ Jazz
3__Name__ Thai Tanic
3__Address__Branch Dr
Then the authors Says :
Now each datum is described alongside the property that defines it. In doing this, we’ve
taken the semantic relationships that previously were inferred from the table and column
and made them data in the table. This is the essence of semantic data modeling:
flexible schemas where the relationships are described by the data itself.
If I want to design the database of my semantic web application should i use this way? what is the practical value ?
What you lose in immediate clarity, you gain in flexibly. Notice with your more parametrized approach you gain the ability to easily add fields without altering any tables. This allows you give different fields to different venues as it suites your application. By association, this also makes it easy to extend your web application via your creation or future maintainer/modification authors (if you intend to release) down the road.
Just be careful when it comes to performance. Don't adopt a fully parametrized design when it is easier to a standard relational design. Let's say, for a moment, you have a two different users tables, one relational the other parametrized:
Table: users_relational
+---------+----------+------------------+----------+
| user_id | username | email | password |
+---------+----------+------------------+----------+
| 1 | Sam | sam#example.com | ******** |
| 2 | John | john#example.com | ******** |
| 3 | Jane | jane#example.com | ******** |
+---------+----------+------------------+----------+
Table: users_parametrized
+---------+----------+------------------+
| user_id | field | value |
+---------+----------+------------------+
| 1 | username | Sam |
| 1 | email | sam#example.com |
| 1 | password | ******** |
| 2 | username | John |
| 2 | email | john#example.com |
| 2 | password | ******** |
| 3 | username | Jane |
| 3 | email | jane#example.com |
| 3 | password | ******** |
+---------+----------+------------------+
Now you want to select a single user. With your relational table, you will only select one row, while your parametrized version will select the number of rows that there are fields associated with that user, in this case 3.
The next issue is searchability (at times). Say you have that same users table from the example above, but instead of knowing the user ID, you only know the username. You may be using two queries, one to find the user id and the other to get the data associated with the user.
Your last con stems from selecting only a few rows at a time. Taking the users tables example again, we can limit the number of fields easily with the relational one:
SELECT username, email FROM users_relational WHERE user_id = 2
We should get a single result with two columns.
Now, for the parametrized table:
SELECT field, value FROM users_parametrized WHERE user_id = 2 AND field IN('username','email')
It's a little more verbose and will become less readable than the first one, especially if you start taking on more and more fields to select.
Additionally, the parametrized will be slower for a few reasons. It now has to do text comparisons from the varchar in the field column, instead of a single, numerically indexed user_id. With the first query, it knows when to stop looking for the record because you're selecting by a primary key. In the parametrized, you are not selecting by a primary key, so you will take a performance hit because your database must look through all the records.
This leads me into the final real difference (as far as your DBMS sees it). There is no primary key in the parametrized, which (as you saw above) can be a performance issue, especially if you already have a considerable number of records. For something like a users table where you can have thousands of records, your record count would be that number times 3 (as we have three non-user_id fields) in this case alone. That's a lot of data for the database to search through.
There are quite a few things to consider when designing your application. Don't be afraid to mix your database with parametrized and relational style - it just has to make sense practically. In the case you gave, it makes perfect sense to do so; in the case I displayed, it would be pointless.
It is possible to stay fully relational while pursuing the intent of storing data in a parameterized fashion. The following is a greatly oversimplified demonstration, but should suffice to show the main tricks that are needed -- in a nutshell, additional levels of abstraction, some surrogate primary keys, and some tables with composite primary keys. I will leave out exact description of foreign key constraints assuming the reader can grasp the obvious relations between tables below.
Your first table is only to establish the entities you want to store information about, and a key to look up what sorts of information will be stored:
entity_id | entity_type
---------------------------
1 | lawn mower
2 | toothbrush
3 | bicycle
4 | restaurant
5 | person
The next table relates entity type to the fields you wish to store for each entity type:
entity_type | attribute
------------------------
lawn mower | horsepower
lawn mower | retail price
lawn mower | gas_or_electric
lawn mower | ...etc
toothbrush | bristle stiffness
toothbrush | weight
toothbrush | head size
toothbrush | retail price
toothbrush | ...etc
person | name
person | email
person | birth date
person | ...etc
This is expandable to as many fields as you like for each entity type. It's still relational; this table does have a primary key, it's just a composite key composed of both columns.
This example is oversimplified for brevity; in actual practice you have to confront the namespacing issues with attributes and you probably want certain attribute names to be per-entity-type in case the same name means something different on an entirely different kind of entity. Use a surrogate primary key for the attributes in order to solve the namespacing issue, if you don't mind the decrease in readability when looking directly at the tables.
Meanwhile, and opposite of the preceding point, it's useful to make common and unambiguous attributes (such as "weight in grams" or "retail price in USD" available for reuse across multiple entity types. To handle this, add a level of abstraction between attributes and entity types. Make a table of "attribute sets", with each set linked to 1..n attributes. Then each entity type in the table above would be linked not directly to attributes, but to one or more attribute sets.
You'll need to either guarantee that attribute sets do not overlap in what attributes they point to, or create a means of resolving conflicts by hierarchy, composition, set union, or whatever fits your needs.
So at this point a lookup for a particular entity goes as follows. From the entity id we get the entity type. From entity type we get 1..n attribute sets, which yield a resulting attribute set that is held by the entity. Finally there is the big table with the actual data in it as follows:
entity_id | attribute_id | value
---------------------------------------
923 | 1049272 | green
923 | 1049273 | 206.55
924 | 1049274 | 843-219-2862
924 | 1049275 | Smith
929 | 1049276 | soft
929 | 1049277 | ...etc
As with all of these tables, this one has a primary key, in this case composed of the entity_id and attribute_id columns. The values are stored in a plain-text column without units. The units are stored in a separate table linking attributes to units. More tables can be established if you need to get more specific on that; you can set up additional levels of abstraction to establish an "attribute type" system similar to the entity type system described above.
If needed, you can go as far as storing relationships such as "attribute X is numerically convertible to attribute Y by the following formula", for numerical attributes. Or for non-numerical attributes you can establish equivalence tables to manage alternate spellings or formats for the allowed values of an attribute.
As you can imagine, the farther you go with your "attribute types and units" system, and the more you use that additional machinery in computation, the slower this all will be. In the worst case you're looking at many joins. But that problem can be addressed with caching and views, if your situation allows you to make tradeoffs such as slowing write speed to gain a great increase in read speed. Also, many of your queries to the database will be in situations where you already know what entity type you're working with at the moment and what its resulting attributes are and their types; so you only have to grab the literal values out of the entity/attribute/value table, and that is plenty fast.
In conclusion, hopefully I have shown how you can get as parameterized as you wish while remaining fully relational. It just requires more tables for more levels of abstraction than some of the simpler approaches do; yet it avoids the disadvantages of the "one-big-table" style. This style of entity>type>attribute>value storage is powerful, flexible, and can be extended as far as you need.
And thanks to a relational/normalized table setup, you can do all sorts of reorganizing along the way as your entity schema evolves, without losing data. The additional levels of abstraction allow you to re-parent attributes from one attribute set to another, change their names if needed, and change which sets of attributes an entity type makes use of, without losing stored values, as long as you write appropriate migrations. The other day I realized I needed to store a certain product attribute on a per-brand basis instead of per-product, and was able to make the schema change in five minutes with only a couple of updated rows in the database. In many other setups, particularly in a one-big-table setup, it could have been a lot more work, requiring as much as one or more updated rows per entity affected by the change.
I'm thinking about how to represent a complex structure in a SQL Server database.
Consider an application that needs to store details of a family of objects, which share some attributes, but have many others not common. For example, a commercial insurance package may include liability, motor, property and indemnity cover within the same policy record.
It is trivial to implement this in C#, etc, as you can create a Policy with a collection of Sections, where Section is inherited as required for the various types of cover. However, relational databases don't seem to allow this easily.
I can see that there are two main choices:
Create a Policy table, then a Sections table, with all the fields required, for all possible variations, most of which would be null.
Create a Policy table and numerous Section tables, one for each kind of cover.
Both of these alternatives seem unsatisfactory, especially as it is necessary to write queries across all Sections, which would involve numerous joins, or numerous null-checks.
What is the best practice for this scenario?
#Bill Karwin describes three inheritance models in his SQL Antipatterns book, when proposing solutions to the SQL Entity-Attribute-Value antipattern. This is a brief overview:
Single Table Inheritance (aka Table Per Hierarchy Inheritance):
Using a single table as in your first option is probably the simplest design. As you mentioned, many attributes that are subtype-specific will have to be given a NULL value on rows where these attributes do not apply. With this model, you would have one policies table, which would look something like this:
+------+---------------------+----------+----------------+------------------+
| id | date_issued | type | vehicle_reg_no | property_address |
+------+---------------------+----------+----------------+------------------+
| 1 | 2010-08-20 12:00:00 | MOTOR | 01-A-04004 | NULL |
| 2 | 2010-08-20 13:00:00 | MOTOR | 02-B-01010 | NULL |
| 3 | 2010-08-20 14:00:00 | PROPERTY | NULL | Oxford Street |
| 4 | 2010-08-20 15:00:00 | MOTOR | 03-C-02020 | NULL |
+------+---------------------+----------+----------------+------------------+
\------ COMMON FIELDS -------/ \----- SUBTYPE SPECIFIC FIELDS -----/
Keeping the design simple is a plus, but the main problems with this approach are the following:
When it comes to adding new subtypes, you would have to alter the table to accommodate the attributes that describe these new objects. This can quickly become problematic when you have many subtypes, or if you plan to add subtypes on a regular basis.
The database will not be able to enforce which attributes apply and which don't, since there is no metadata to define which attributes belong to which subtypes.
You also cannot enforce NOT NULL on attributes of a subtype that should be mandatory. You would have to handle this in your application, which in general is not ideal.
Concrete Table Inheritance:
Another approach to tackle inheritance is to create a new table for each subtype, repeating all the common attributes in each table. For example:
--// Table: policies_motor
+------+---------------------+----------------+
| id | date_issued | vehicle_reg_no |
+------+---------------------+----------------+
| 1 | 2010-08-20 12:00:00 | 01-A-04004 |
| 2 | 2010-08-20 13:00:00 | 02-B-01010 |
| 3 | 2010-08-20 15:00:00 | 03-C-02020 |
+------+---------------------+----------------+
--// Table: policies_property
+------+---------------------+------------------+
| id | date_issued | property_address |
+------+---------------------+------------------+
| 1 | 2010-08-20 14:00:00 | Oxford Street |
+------+---------------------+------------------+
This design will basically solve the problems identified for the single table method:
Mandatory attributes can now be enforced with NOT NULL.
Adding a new subtype requires adding a new table instead of adding columns to an existing one.
There is also no risk that an inappropriate attribute is set for a particular subtype, such as the vehicle_reg_no field for a property policy.
There is no need for the type attribute as in the single table method. The type is now defined by the metadata: the table name.
However this model also comes with a few disadvantages:
The common attributes are mixed with the subtype specific attributes, and there is no easy way to identify them. The database will not know either.
When defining the tables, you would have to repeat the common attributes for each subtype table. That's definitely not DRY.
Searching for all the policies regardless of the subtype becomes difficult, and would require a bunch of UNIONs.
This is how you would have to query all the policies regardless of the type:
SELECT date_issued, other_common_fields, 'MOTOR' AS type
FROM policies_motor
UNION ALL
SELECT date_issued, other_common_fields, 'PROPERTY' AS type
FROM policies_property;
Note how adding new subtypes would require the above query to be modified with an additional UNION ALL for each subtype. This can easily lead to bugs in your application if this operation is forgotten.
Class Table Inheritance (aka Table Per Type Inheritance):
This is the solution that #David mentions in the other answer. You create a single table for your base class, which includes all the common attributes. Then you would create specific tables for each subtype, whose primary key also serves as a foreign key to the base table. Example:
CREATE TABLE policies (
policy_id int,
date_issued datetime,
-- // other common attributes ...
);
CREATE TABLE policy_motor (
policy_id int,
vehicle_reg_no varchar(20),
-- // other attributes specific to motor insurance ...
FOREIGN KEY (policy_id) REFERENCES policies (policy_id)
);
CREATE TABLE policy_property (
policy_id int,
property_address varchar(20),
-- // other attributes specific to property insurance ...
FOREIGN KEY (policy_id) REFERENCES policies (policy_id)
);
This solution solves the problems identified in the other two designs:
Mandatory attributes can be enforced with NOT NULL.
Adding a new subtype requires adding a new table instead of adding columns to an existing one.
No risk that an inappropriate attribute is set for a particular subtype.
No need for the type attribute.
Now the common attributes are not mixed with the subtype specific attributes anymore.
We can stay DRY, finally. There is no need to repeat the common attributes for each subtype table when creating the tables.
Managing an auto incrementing id for the policies becomes easier, because this can be handled by the base table, instead of each subtype table generating them independently.
Searching for all the policies regardless of the subtype now becomes very easy: No UNIONs needed - just a SELECT * FROM policies.
I consider the class table approach as the most suitable in most situations.
The names of these three models come from Martin Fowler's book Patterns of Enterprise Application Architecture.
The 3rd option is to create a "Policy" table, then a "SectionsMain" table that stores all of the fields that are in common across the types of sections. Then create other tables for each type of section that only contain the fields that are not in common.
Deciding which is best depends mostly on how many fields you have and how you want to write your SQL. They would all work. If you have just a few fields then I would probably go with #1. With "lots" of fields I would lean towards #2 or #3.
In addition at the Daniel Vassallo solution, if you use SQL Server 2016+, there is another solution that I used in some cases without considerable lost of performances.
You can create just a table with only the common field and add a single column with the JSON string that contains all the subtype specific fields.
I have tested this design for manage inheritance and I am very happy for the flexibility that I can use in the relative application.
With the information provided, I'd model the database to have the following:
POLICIES
POLICY_ID (primary key)
LIABILITIES
LIABILITY_ID (primary key)
POLICY_ID (foreign key)
PROPERTIES
PROPERTY_ID (primary key)
POLICY_ID (foreign key)
...and so on, because I'd expect there to be different attributes associated with each section of the policy. Otherwise, there could be a single SECTIONS table and in addition to the policy_id, there'd be a section_type_code...
Either way, this would allow you to support optional sections per policy...
I don't understand what you find unsatisfactory about this approach - this is how you store data while maintaining referential integrity and not duplicating data. The term is "normalized"...
Because SQL is SET based, it's rather alien to procedural/OO programming concepts & requires code to transition from one realm to the other. ORMs are often considered, but they don't work well in high volume, complex systems.
The another way to do it, is using the INHERITS component. For example:
CREATE TABLE person (
id int ,
name varchar(20),
CONSTRAINT pessoa_pkey PRIMARY KEY (id)
);
CREATE TABLE natural_person (
social_security_number varchar(11),
CONSTRAINT pessoaf_pkey PRIMARY KEY (id)
) INHERITS (person);
CREATE TABLE juridical_person (
tin_number varchar(14),
CONSTRAINT pessoaj_pkey PRIMARY KEY (id)
) INHERITS (person);
Thus it's possible to define a inheritance between tables.
Alternatively, consider using a document databases (such as MongoDB) which natively support rich data structures and nesting.
I lean towards method #1 (a unified Section table), for the sake of efficiently retrieving entire policies with all their sections (which I assume your system will be doing a lot).
Further, I don't know what version of SQL Server you're using, but in 2008+ Sparse Columns help optimize performance in situations where many of the values in a column will be NULL.
Ultimately, you'll have to decide just how "similar" the policy sections are. Unless they differ substantially, I think a more-normalized solution might be more trouble than it's worth... but only you can make that call. :)