Database design to support dynamic entities

Database design to support dynamic entities - sql-server

OK, I don't know whether this question belong to this place, but you will suggest me if I'm wrong.
I have some entities which has almost same attributes, differences is in maybe 2-3 columns.
Because of those different columns, I can't create one table with columns that are union of attributes of every entity, because new entity type will require changing table design adding new columns specific to that entity type.
Instead, currently working design is that every specific entity has own table.
But, if new type of entity come on scene, I must create new table, which is totally bad idea.
How can I create one table which consists shared attributes for each type of entity, and some additional mechanism to evidence entity-unique attributes?
So, idea is to easy add new types of objects, without changing database design, configuring only part that deal with unique columns.
P.S. Maybe I'm not clear, but I will add more description if is it needed.

I had a design like that once. What I did was I created a table that housed all the shared properties. Then, I had separate tables for the distinct values. I used joins to match a specific entity to its shared table row. I had less than 10, so my views that used unions I just updated when I added a new entity. But, if you used a naming convention, you could write stored procs that find the table names dynamically and do the unions and joins on the fly. In my case, I used a base class and specific classes to make a custom data layer.
Another possibility is to have a generic table that's basically name/value pairs and a table the represents your shared properties. By joining the tables together, you could have any number of entity specific properties for your entities. It's not very efficient and the SQL would get weird, but I've seen it done.

One solution is to store the common parts in one table, and the specific parts in tables specific to that entity.
eg: To have a set of people, some of whom are managers...
Person Table
PersonID
PersonName
Manager Table
ManagerID
PersonID
DepartmentManaged
As soon as you go down the path of having one table with variable field meanings - effectively an Entity Attribute Value design - you find yourself in querying hell.

Perhaps not the best or most academic, but what about this kind of "open structure" ?
MainTable: all common fields
SpecialProperties: extra properties, as required
- MainRecordId (P, F->MainTable)
- PropertyName (P)
- PropertyText
- PropertyValue (for numeric values)

Related

the correct structure relation between student vs teacher comparing to user table

What is the correct database structure design for user table if I have instructor vs student?
Is it correct to create each one in separate table and the id is the user table
Like this?:
Or creating new flag field (1 or 2) to define the students from the instructors like this?:
I know both will work, but I'm asking to get the most professional answer for this problem.
For me i'm working on Laravel and creating a relations there is very easy.

The first option - separate tables for each subtype of user - is recommended when you have subtype-specific attributes, relationships or constraints; and a known fixed set of subtypes.
The second option - a type indicator column - is recommended when you don't have any subtype-specific attributes, relationships or constraints; and works better for user-managed subtypes.
It doesn't matter whether you have overlapping or disjoint subtypes. Either can be recorded in separate tables; or overlapping subtypes can be indicated via multiple boolean fields and disjoint via a single field.

Datomic table model

I have an application that requires a database containing a set of products where each product can have a set of tables. The end-user should be able to add new products and define new tables for a product. So each table has a set of columns that are specified by the user. The user can then fill the tables with rows of data. Each table belongs to exactly one product.
The end-user should also be able to view the tables as they were at a specific point in time (at a certain transaction).
How would I go about making a schema for this in Datomic so that querying it would be as efficient as possible?

I would go with 4 entity types: products, tables, columns, and rows.
The relationship between products and tables is best handled by a :table/product to-one ref attribute, but a :product/tables to-many component ref attribute could also work (the latter does not enforce the one-to-many relationship).
Likewise, I would use either a :column/table or :table/columns attribute. I would also have a :column/name string attribute and maybe a :column/type enumerated attribute.
The hardest part is to model rows.
One tempting solution is to just create an attribute per column - I actually think it's bad idea, Datomic attributes are not intended for such a dynamic use. In particular, schema attributes are stored in a cache on the Peer that's not meant to grow big. (I may be wrong about this, so it'd be nice if someone in the Datomic team could confirm.)
Instead, I would have a few dozens reusable :row/cell-0, :row/cell-1, :row/cell-2, etc. 'cell position' attributes, that are shared across all tables. Each actual column would be mapped to a at creation time by a to-one :column/position attribute.
If the rows can have several data types, it's a bit more difficult, you'd have to basically make an attribute for each (type,position) pair.
Then each row basically consist of a :row/table attribute and the above cell position attributes.
Here's a Datalog query that would let you read the whole table
[:find ?row ?column-name ?val :in $ ?table :where
[?column :column/table ?table]
[?row :row/table ?table]
[?row ?pos ?val]
[?column :column/position ?pos]
[?column :column/name ?column-name]]
Note that all of the above is only useful if you want to query the table with Datalog directly against your Datomic db. But it can be also completely fine to serialize your tables and store them as blobs - especially if they're small; later, you pull out the blob, deserialize it, then you can query with Datalog too. And if tables are to coarse for this use, maybe you can do it with rows.

Dynamic columns in database tables vs EAV

I'm trying to decide which way to go if I have an app that needs to be able to change the db schema based on the user input.
For example, if I have a "car" object that contains car properties, like year, model, # of doors etc, how do I store it in the DB in such a way, that the user should be able to add new properties?
I read about EAV tables and they seem right for this thing, but the problem is that queries will get pretty complicated when I try to get a list of cars filtered by a set of properties.
Could I generate the tables dynamically instead? I see that Sqlite has support for ADD COLUMN, but how fast is it when the table reaches many records? And it looks like there's no way to remove a column. I have to create a new table without the column I want to remove, and copy the data from the old table. That's certainly slow on large tables :(

I will assume that SQLite (or another relational DBMS) is a requirement.
EAVs
I have worked with EAVs and generic data models, and I can say that the data model is very messy and hard to work with in the long run.
Lets say that you design a datamodel with three tables: entities, attributes, and _entities_attributes_:
CREATE TABLE entities
(entity_id INTEGER PRIMARY KEY, name TEXT);
CREATE TABLE attributes
(attribute_id INTEGER PRIMARY KEY, name TEXT, type TEXT);
CREATE TABLE entity_attributes
(entity_id INTEGER, attribute_id INTEGER, value TEXT,
PRIMARY KEY(entity_id, attribute_id));
In this model, the entities table will hold your cars, the attributes table will hold the attributes that you can associate to your cars (brand, model, color, ...) and its type (text, number, date, ...), and the _entity_attributes_ will hold the values of the attributes for a given entity (for example "red").
Take into account that with this model you can store as many entities as you want and they can be cars, houses, computers, dogs or whatever (ok, maybe you need a new field on entities, but it's enough for the example).
INSERTs are pretty straightforward. You only need to insert a new object, a bunch of attributes and its relations. For example, to insert a new entity with 3 attributes you will need to execute 7 inserts (one for the entity, three more for the attributes, and three more for the relations.
When you want to perform an UPDATE, you will need to know what is the entity that you want to update, and update the desired attribute joining with the relation between the entity and its attributes.
When you want to perform a DELETE, you will also need to need to know what is the entity you want to delete, delete its attributes, delete the relation between your entity and its attributes and then delete the entity.
But when you want to perform a SELECT the thing becomes nasty (you need to write really difficult queries) and the performance drops horribly.
Imagine a data model to store car entities and its properties as in your example (say that we want to store brand and model). A SELECT to query all your records will be
SELECT brand, model FROM cars;
If you design a generic data model as in the example, the SELECT to query all your stored cars will be really difficult to write and will involve a 3 table join. The query will perform really bad.
Also, think about the definition of your attributes. All your attributes are stored as TEXT, and this can be a problem. What if somebody makes a mistake and stores "red" as a price?
Indexes are another thing that you could not benefit of (or at least not as much as it would be desirable), and they are very neccesary as the data stored grows.
As you say, the main concern as a developer is that the queries are really hard to write, hard to test and hard to maintain (how much would a client have to pay to buy all red, 1980, Pontiac Firebirds that you have?), and will perform very poorly when the data volume increases.
The only advantage of using EAVs is that you can store virtually everything with the same model, but is like having a box full of stuff where you want to find one concrete, small item.
Also, to use an argument from authority, I will say that Tom Kyte argues strongly against generic data models:
http://tkyte.blogspot.com.es/2009/01/this-should-be-fun-to-watch.html
https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:10678084117056
Dynamic columns in database tables
On the other hand, you can, as you say, generate the tables dynamically, adding (and removing) columns when needed. In this case, you can, for example create a car table with the basic attributes that you know that you will use and then add columns dynamically when you need them (for example the number of exhausts).
The disadvantage is that you will need to add columns to an existing table and (maybe) build new indexes.
This model, as you say, also has another problem when working with SQLite as there's no direct way to delete columns and you will need to do this as stated on http://www.sqlite.org/faq.html#q11
BEGIN TRANSACTION;
CREATE TEMPORARY TABLE t1_backup(a,b);
INSERT INTO t1_backup SELECT a,b FROM t1;
DROP TABLE t1;
CREATE TABLE t1(a,b);
INSERT INTO t1 SELECT a,b FROM t1_backup;
DROP TABLE t1_backup;
COMMIT;
Anyway, I don't really think that you will need to delete columns (or at least it will be a very rare scenario). Maybe someone adds the number of doors as a column, and stores a car with this property. You will need to ensure that any of your cars have this property to prevent from losing data before deleting the column. But this, of course depends on your concrete scenario.
Another drawback of this solution is that you will need a table for each entity you want to store (one table to store cars, another to store houses, and so on...).
Another option (pseudo-generic model)
A third option could be to have a pseudo-generic model, with a table having columns to store id, name, and type of the entity, and a given (enough) number of generic columns to store the attributes of your entities.
Lets say that you create a table like this:
CREATE TABLE entities
(entity_id INTEGER PRIMARY KEY,
name TEXT,
type TEXT,
attribute1 TEXT,
attribute1 TEXT,
...
attributeN TEXT
);
In this table you can store any entity (cars, houses, dogs) because you have a type field and you can store as many attributes for each entity as you want (N in this case).
If you need to know what the attribute37 stands for when type is "red", you would need to add another table that relates the types and attributes with the description of the attributes.
And what if you find that one of your entities needs more attributes? Then simply add new columns to the entities table (attributeN+1, ...).
In this case, the attributes are always stored as TEXT (as in EAVs) with it's disadvantages.
But you can use indexes, the queries are really simple, the model is generic enough for your case, and in general, I think that the benefits of this model are greater than the drawbacks.
Hope it helps.
Follow up from the comments:
With the pseudo-generic model your entities table will have a lot of columns. From the documentation (https://www.sqlite.org/limits.html), the default setting for SQLITE_MAX_COLUMN is 2000. I have worked with SQLite tables with over 100 columns with great performance, so 40 columns shouldn't be a big deal for SQLite.
As you say, most of your columns will be empty for most of your records, and you will need to index all of your colums for performance, but you can use partial indexes (https://www.sqlite.org/partialindex.html). This way, your indexes will be small, even with a high number of rows, and the selectivity of each index will be great.
If you implement a EAV with only two tables, the number of joins between tables will be less than in my example, but the queries will still be hard to write and maintain, and you will need to do several (outer) joins to extract data, which will reduce performance, even with a great index, when you store a lot of data. For example, imagine that you want to get the brand, model and color of your cars. Your SELECT would look like this:
SELECT e.name, a1.value brand, a2.value model, a3.value color
FROM entities e
LEFT JOIN entity_attributes a1 ON (e.entity_id = a1.entity_id and a1.attribute_id = 'brand')
LEFT JOIN entity_attributes a2 ON (e.entity_id = a2.entity_id and a2.attribute_id = 'model')
LEFT JOIN entity_attributes a3 ON (e.entity_id = a3.entity_id and a3.attribute_id = 'color');
As you see, you would need one (left) outer join for each attribute you want to query (or filter). With the pseudo-generic model the query will be like this:
SELECT name, attribute1 brand, attribute7 model, attribute35 color
FROM entities;
Also, take into account the potential size of your _entity_attributes_ table. If you can potentially have 40 attributes for each entity, lets say that you have 20 not null for each of them. If you have 10,000 entities, your _entity_attributes_ table will have 200,000 rows, and you will be querying it using one huge index. With the pseudo-generic model you will have 10,000 rows and one small index for each column.

It all depends on the way in which your application needs to reason about the data.
If you need to run queries which need to do complicated comparisons or joins on data whose schema you don't know in advance, SQL and the relational model are rarely a good fit.
For instance, if your users can set up arbitrary data entities (like "car" in your example), and then want to find cars whose engine capacity is greater than 2000cc, with at least 3 doors, made after 2010, whose current owner is part of the "little old ladies" table, I'm not aware of an elegant way of doing this in SQL.
However, you could achieve something like this using XML, XPath etc.
If your application has a set on data entities with known attributes, but users can extend those attributes (a common requirement for products like bug trackers), "add column" is a good solution. However, you may need to invent a custom query language to allow users to query those columns. For instance, Atlassian Jira's bug tracking solution has JQL, a SQL-like language for querying bugs.
EAV is great if your task is to store and then show data. However, even moderately complex queries become very hard in an EAV schema - imagine how you'd execute my made up example above.

For your use case, a document oriented database like MongoDB would do great.

Another option that I haven't seen mentioned above is to use denormalized tables for the extended attributes. This is a combination of the pseudo-generic model and the dynamic columns in database tables. Instead of adding columns to existing tables, you add columns or groups of columns into new tables with FK indexes to the source table. Of course, you'll want a good naming convention (car, car_attributes_door, car_attributes_littleOldLadies)
Your selection problem becomes that of applying a LEFT OUTER JOIN to include the extended attributes that you want to include.
Slower than normalized, but not as slow as EAV.
Adding new extended attributes becomes a problem of adding a new table.
Harder than EAV, easier/faster than modifying table schema.
Deleting attributes becomes a problem of dropping whole tables.
Easier/faster than modifying table schema.
These new attributes can be strongly typed.
As good as modifying table schema, faster than EAV or generic columns.
The biggest advantage to this approach that I can see is that deleting unused attributes is quite easy compared to any of the others via a single DROP TABLE command. You also have the option to later normalize often-used attributes into larger groups or into the main table using a single ALTER TABLE process rather than one for each new column you were adding as you added them, which helps with the slow LEFT OUTER JOIN queries.
The biggest disadvantage is that you're cluttering up your table list, which admittedly is often not a trivial concern. That and I'm not sure how much better LEFT OUTER JOIN's actually perform than EAV table joins. It's definitely closer to EAV join performance than normalized table performance.
If you're doing a lot of comparisons/filters of values that benefit greatly from strongly typed columns, but you add/remove these columns frequently enough to make modifying a huge normalized table intractable, this seems like a good compromise.

I would try EAV.
Adding columns based on user input doesn't sounds nice to me and you can quickly run out of capacity. Queries on very flat table can also be a problem. Do you want to create hundreds of indexes?
Instead of writing every thing to one table, I would store as many as possible common properties (price, name , color, ...) in the main table and those less common properties in an "extra" attributes table. You can always balance them later with a little effort.
EAV can performance well for small to middle sized data set. Since you want to use SQLlite, I guess it's not be a problem.
You may also want to avoid "over" normalizing your data. With the cheap storage
we currently have, you can use one table to store all "Extra" attributes, instead of two:
ent_id, ent_name, ...
ent_id, attr_name, attr_type, attr_value ...
People against EAV will say its performance is poor on large database. It's sure that it won't performance as well as normalized structure but you don't want to change structure on a 3TB table either.

I have a low quality answer, but possible, that came from HTML tags that are like : <tag width="10px" height="10px" ... />
In this dirty way you will have just one column as a varchar(max) for all properties say it Props column and you will store data in it like this:
Props
------------------------------------------------------------
Model:Model of car1|Year:2010|# of doors:4
Model:Model of car2|NewProp1:NewValue1|NewProp2:NewValue2
In this way all works will go to the programming code in business layer with using some functions like concatCustom that get an array and return a string and unconcatCustom that get a string and return an array.
For more validity of special characters like ':' and '|', I suggest '#:#' and '#|#' or something more rare for splitter part.
In a similar way you can use a text or binary field and store an XML data in the column.

Many tables to a single row in relational database

Consider we have a database that has a table, which is a record of a sale. You sell both products and services, so you also have a product and service table.
Each sale can either be a product or a service, which leaves the options for designing the database to be something like the following:
Add columns for each type, ie. add Service_id and Product_id to Invoice_Row, both columns of which are nullable. If they're both null, it's an ad-hoc charge not relating to anything, but if one of them is satisfied then it is a row relating to that type.
Add a weird string/id based system, for instance: Type_table, Type_id. This would be a string/varchar and integer respectively, the former would contain for example 'Service', and the latter the id within the Service table. This is obviously loose coupling and horrible, but is a way of solving it so long as you're only accessing the DB from code, as such.
Abstract out the concept of "something that is chargeable" for with new tables, of which Product and Service now are an abstraction of, and on the Invoice_Row table you would link to something like ChargeableEntity_id. However, the ChargeableEntity table here would essentially be redundant as it too would need some way to link to an abstract "backend" table, which brings us all the way back around to the same problem.
Which way would you choose, or what are the other alternatives to solving this problem?

What you are essentially asking is how to achieve polymorphism in a relational database. There are many approaches (as you yourself demonstrate) to this problem. One solution is to use "table per class" inheritance. In this setup, there will be a parent table (akin to your "chargeable item") that contains a unique identifier and the fields that are common to both products and services. There will be two child tables, products and goods: Each will contain the unique identifier for that entity and the fields specific to it.
One benefit to this approach over others is you don't end up with one table with many nullable columns that essentially becomes a dumping ground to describe anything ("schema-less").
One downside is as your inheritance hierarchy grows, the number of joins needed to grab all the data for an entity also grows.

I believe it depends on use case(s).
You could put the common columns in one table and put product and service specific columns in its own tables.Here the deal is that you need to join stuff.
Else if you maintain two separate tables, one for Product and another for Sale. You use application logic to determine which table to insert into. And getting all sales will essentially mean , union of getting all products and getting all sale.
I would go for approach 2 personally to avoid joins and inserting into two tables whenever a sale is made.

What's more readable naming conventions for lookup tables?

We always name lookup tables - such as Countries,Cities,Regions ... etc - as below :
EntityName_LK OR LK_EntityName ( Countries_LK OR LK_Countries )
But I ask if any one have more better naming conversions for lookup tables ?
Edit:
We think to make postfix or prefix to solve like a conflict :
if we have User tables and lookup table for UserTypes (ID-Name) and we have a relation many to many between User & UserTypes that make us a table which we can name it like Types_For_User that may make confusion between UserTypes & Types_For_User So we like to make lookup table UserTypes to be like UserTypesLK to be obvious to all

Before you decide you need the "lookup" moniker, you should try to understand why you are designating some tables as "lookups" and not others. Each table should represent an entity unto itself.
What happens when a table that was designated as a "lookup" grows in scope and is no longer considered a "lookup"? You are either left with changing the table name which can be onerous or leaving it as is and having to explain to everyone that a given table isn't really a "lookup".
A common scenario mentioned in the comments related to a junction table. For example, suppose a User can have multiple "Types" which are expressed in a junction table with two foreign keys. Should that table be called User_UserTypes? To this scenario, I would first say that I prefer to use the suffix Member on the junction table. So we would have Users, UserTypes, UserTypeMembers. Secondly, the word "type" in this context is quite generic. Does a UserType really mean a Role? The term you use can make all the difference. If UserTypes are really Roles, then our table names become Users, Roles, RoleMembers which seems quite clear.

Here are two concerns for whether to use a prefix or suffix.
In a sorted list of tables, do you want the LK tables to be together or do you want all tables pertaining to EntityName to appear together
When programming in environments with auto-complete, are you likely to want to type "LK" to get the list of tables or the beginning of EntityName?
I think there are arguments for either, but I would choose to start with EntityName.

Every table can become a lookup table.
Consider that a person is a lookup in an Invoice table.
So in my opinion, tables should just be named the (singular) entity name, e.g. Person, Invoice.
What you do want is a standard for the column names and constraints, such as
FK_Invoice_Person (in table invoice, link to person)
PersonID or Person_ID (column in table invoice, linking to entity Person)
At the end of the day, it is all up to personal preference (if you can get away with dictating it) or team standards.
updated
If you have lookups that pertain only to entities, like Invoice_Terms which is a lookup from a list of 4 scenarios, then you could name it as Invoice_LK_Terms which would make it appear by name grouped under Invoice. Another way is to have a single lookup table for simple single-value lookups, separated by the function (table+column) it is for, e.g.
Lookups
Table | Column | Value

There is only one type of table and I don't believe there is any good reason for calling some tables "lookup" tables. Use a naming convention that works equally for every table.

One area where table naming conventions can help is data migration between environments. We often have to move data in lookup tables (which constrain values which may appear in other tables) along with schema changes, as these allowed value lists change. Currently we don't name lookup tables differently, but we are considering it to prevent the migration guy asking "which tables are lookup tables again?" every time.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight