I have a general question about relational database design. I have a large list of objects from a table. These objects are classified with three descriptor classes. There is distributor1, distributor2 and distributor3. To describe one object one always builds a triplet of these descriptors.
For example assume that descriptor1 is colour, descriptor2 is size and descriptor3 is weight. For each object I would build a triplet which describes that object.
In my case I have thousands of entries for each descriptor. So I build a table for each descriptor. How can I now build a tripled and relate that to an object in the object table?
If there would be only one such tripled for each object, I could just store the three descriptor ids in each object as a foreign key, but let's assume that each object can have 0 or many such triplets.
I am using sqlalchemy, but I am happy to do the coding myself, I am just looking for keywords to look for in the documentation, since so far I could not find much.
My solution would be to create another table with three descriptor ids and and object id. Is that the way to go?
I could also store a string with tripled of descriptor ids in each object... but that seems to go very much against the principle of relational databases...
There's rarely a perfect design for all scenarios. What you've described would work well, if you know that you'll never need another attribute and you'll always lookup that row by using all three attributes. It depends on your use case, but those are pretty limiting assumptions.
Adding more attributes or looking up records by 1 or 2 attributes instead of all 3 is when Lucas's suggestion of adding additional columns that can be indexed is more flexible. The ability to define an arbitrary set of columns within a non-clustered index is where the relational db tends to get a lot of it's search performance/flexibility.
Related
In our database design we have a couple of tables that describe different objects but which are of the same basic type. As describing the actual tables and what each column is doing would take a long time I'm going to try to simplify it by using a similar structured example based on a job database.
So say we have following tables:
These tables have no connections between each other but share identical columns. So the first step was to unify the identical columns and introduce a unique personId:
Now we have the "header" columns in person that are then linked to the more specific job tables using a 1 to 1 relation using the personId PK as the FK. In our use case a person can only ever have one job so the personId is also unique across the Taxi driver, Programmer and Construction worker tables.
While this structure works we now have the use case where in our application we get the personId and want to get the data of the respective job table. This gets us to the problem that we can't immediately know what kind of job the person with this personId is doing.
A few options we came up with to solve this issue:
Deal with it in the backend
This means just leaving the architecture as it is and look for the right table in the backend code. This could mean looking through every table present and/or construct a semi-complicated join select in which we have to sift through all columns to find the ones which are filled.
All in all: Possible but means a lot of unecessary selects. We also would like to keep such database oriented logic in the actual database.
Using a Type Field
This means adding a field column in the Person table filled for example with numbers to determine the correct child table like:
So you could add a 0 in Type if it's a taxi driver, a 1 if it's a programmer and so on...
While this greatly reduced the amount of backend logic we then have to make sure that the numbers we use in the Type field are known in the backend and don't ever change.
Use separate IDs for each table
That means every job gets its own ID (has to be nullable) in Person like:
Now it's easy to find out which job each person has due to the others having an empty ID.
So my question is: Which one of these designs is the best practice? Am i missing an obvious solution here?
Bill Karwin made a good explanation on a problem similar to this one. https://stackoverflow.com/a/695860/7451039
We've now decided to go with the second option because it seem to come with the least drawbacks as described by the other commenters and posters. As there was no actual answer portraying the second option as a solution i will try to summarize our reasoning:
Against Option 1:
There is no way to distinguish the type from looking at the parent table. As a result the backend would have to include all logic which includes scanning all tables for the that contains the id. While you can compress most of the logic into a single big Join select it would still be a lot more logic as opposed to the other options.
Against Option 3:
As #yuri-g said this one is technically not possible as the separate IDs could not setup as primary keys. They would have to be nullable and as a result can't be indexed, essentially rendering the parent table useless as one of the reasons for it was to have a unique personID across the tables.
Against a single table containing all columns:
For smaller use cases as the one i described in the question this might me viable but we are talking about a bunch of tables with each having roughly 2-6 columns. This would make this option turn into a column-mess really quickly.
Against a flat design with a key-value table:
Our properties have completly different data types, different constraints and foreign key relations. All of this would not be possible/difficult in this design.
Against custom database objects containt the child specific properties:
While this option that #Matthew McPeak suggested might be a viable option for a lot of people our database design never really used objects so introducing them to the mix would likely cause confusion more than it would help us.
In favor of the second option:
This option is easy to use in our table oriented database structure, makes it easy to distinguish the proper child table and does not need a lot of reworking to introduce. Especially since we already have something similar to a Type table that we can easily use for this purpose.
Third option, as you describe it, is impossible: no RDBMS (at least, of I personally know about) would allow you to use NULLs in PK (even composite).
Second is realistic.
And yes, first would take up to N queries to poll relatives in order to determine the actual type (where N is the number of types).
Although you won't escape with one query in second case either: there would always be two of them, because you cant JOIN unless you know what exactly you should be joining.
So basically there are flaws in your design, and you should consider other options there.
Like, denormalization: line non-shared attributes into the parent table anyway, then fields become nulls for non-correpondent types.
Or flexible, flat list of attribute-value pairs related through primary key (yes, schema enforcement is a trade-off).
Or switch to column-oriented DB: that's a case for it.
Currently scoping out a new system. Like many systems, it will be required to store documents and link them to other kinds of item. In this instance a Document object can belong to a Job or it can belong to an Item (which in turn belongs to a Job).
We could do this by having a JobId and an ItemId against a Document and leaving one or the other blank if necessary, but that's going to mean annoying conditional logic in the handling code. So, two link tables seems a better idea.
However, it is likely that we will need to link Documents to other items in the system at some point in the future. There are Company and User objects, for example, and we might want to record Documents against those. There may be more.
That would entail a proliferation of link tables which, while effective, is messy and hard to follow.
This solution is in SQL Server and will be handled in code via Entity Framework.
Are there any design principles that can allow us to hook up Document objects with a variety of other system objects as required in a neater and more flexible way?
You could store two values: the id, and the type of object to which the document is attached. It doesn't allow the use of foreign keys, but is compatible with many application development frameworks.
If you have the partitioning option then you could dedicate different partitions to different object types.
You could also have multiple tables, one for job documents, one for item documents, and get an overview of all of them with a view that UNION ALL's them together. If you need uniqueness in that result set then you could use UUIDs for the primary key, or add an extra column to the view to express from which table the row was read.
After-edit: Wow, this question go long. Please forgive =\
I am creating a new table consisting of over 30 columns. These columns are largely populated by selections made from dropdown lists and their options are largely logically related. For example, a dropdown labeled Review Period will have options such as Monthly, Semi-Annually, and Yearly. I came up with a workable method to normalize these options down to numeric identifiers by creating a primitives lookup table that stores values such as Monthly, Semi-Annually, and Yearly. I then store the IDs of these primitives in the table of record and use a view to join that table out to my lookup table. With this view in place, the table of record can contain raw data that only the application understands while allowing external applications and admins to run SQL against the view and return data that is translated into friendly information.
It just got complicated. Now these dropdown lists are going to have non-logically-related items. For example, the Review Period dropdown list now needs to have options of NA and Manual. This blows my entire grouping scheme out of the water.
Similar constructs that have been used in this application have resorted to storing repeated string values across multiple records. This means you could have hundreds of records with the string 'Monthly' stored in the table's ReviewPeriod column. The thought of this happening has made me cringe since I've started working here, but now I am starting to think that non-normalized data may be the best option here.
The only other way I can think of doing this using my initial method while allowing it to be dynamic and support the constant adding of new options to any dropdown list at any time is this: When saving the data to the database, iterate through every single property of my business object (.NET class in this case) and check for any string value that exists in the primitives table. If it doesn't, add it and return the auto-generated unique identifier for storage in the table of record. It seems so complicated, but is this what one is to go through for the sake of normalized data?
Anything is possible. Nobody is going to haul you off to denormalization jail and revoke your DBA card. I would say that you should know the rules and what breaking them means. Once you have those in hand, it's up to your and your best judgement to do what you think is best.
I came up with a workable method to normalize these options down to
numeric identifiers by creating a primitives lookup table that stores
values such as Monthly, Semi-Annually, and Yearly. I then store the
IDs of these primitives in the table of record and use a view to join
that table out to my lookup table.
Replacing text with ID numbers has nothing at all to do with normalization. You're describing a choice of surrogate keys over natural keys. Sometimes surrogate keys are a good choice, and sometimes surrogate keys are a bad choice. (More often a bad choice than you might believe.)
This means you could have hundreds of records with the string
'Monthly' stored in the table's ReviewPeriod column. The thought of
this happening has made me cringe since I've started working here, but
now I am starting to think that non-normalized data may be the best
option here.
Storing the string "Monthly" in multiple rows has nothing to do with normalization. (Or with denormalization.) This seems to be related to the notion that normalization means "replace all text with id numbers". Storing text in your database shouldn't make you cringe. VARCHAR(n) is there for a reason.
The only other way I can think of doing this using my initial method
while allowing it to be dynamic and support the constant adding of new
options to any dropdown list at any time is this: When saving the data
to the database, iterate through every single property of my business
object (.NET class in this case) and check for any string value that
exists in the primitives table. If it doesn't, add it and return the
auto-generated unique identifier for storage in the table of record.
Let's think about this informally for a minute.
Foreign keys provide referential integrity. Their purpose is to limit the values allowed in a column. Informally, the referenced table provides a set of valid values. Values that aren't in that table aren't allowed in the referencing column of other tables.
But no matter what the user types in, you're going to add it to that table of valid values.
If you're going to accept everything the user types in the first place, why use a foreign key at all?
The main problem here is that you've been poorly served by the people who taught you (mis-taught you) the relational model. (And, probably, equally poorly by the people who taught you SQL.) I hope you can unlearn those mistaken notions quickly, and soon make real progress.
I'm trying to figure out how to implement this relationship in coldfusion. Also if anyone knows the name for this kind of relationship I'd be curious to know it.
I'm trying to create the brown table.
Recreating the table from the values is not the problem, the problem that I've been stuck with for a couple of days now is how to create an editing environment.
I'm thinking that I should have a table with all the Tenants and TenantValues (TenantValues that match TenantID I'm editing) and have the empty values as well (the green table)
any other suggestions?
The name of this relationship is called an Entity Attribute Value model (EAV). In your case Tenant, TenantVariable, TenantValues are the entity, attribute and value tables, respectively. EAV is attempt to allow for the runtime definition or entities and is most found in my experience backing content managements systems. It has been referred to an as anti pattern database model because you lose certain RDBMS advantages, while gaining disadvantages such as having to lock several tables on delete or save. Often a suitable persistence alternative is a NoSQL solution such as Couch.
As for edits, the paradigm I typically see is deleting all the value records for a given ID and inserting inside a loop, and then updating the entity table record. Do this inside of a transaction to ensure consistency. The upshot of this approach is that it's must easier to figure out than delta detection algorithm. Another option is using the MERGE statement if your database supports it.
You may want to consider an RDF Triple Store for this problem. It's an alternative to Relational DBs that's particularly good for sparse categorical data. The data is represented as triples - directed graph edges consisting of a subject, an object, and the predicate that describes the property connecting them:
(subject) (predicate) (object)
Some example triples from your data set would look something like:
<Apple> rdf:type <Red_Fruit>
<Apple> hasWeight "1"^^xsd:integer
RDF triple stores provide the SPARQL query language to retrieve data from your store much like you would use SQL.
We are working on a mapping application that uses Google Maps API to display points on a map. All points are currently fetched from a MySQL database (holding some 5M + records). Currently all entities are stored in separate tables with attributes representing individual properties.
This presents following problems:
Every time there's a new property we have to make changes in the database, application code and the front-end. This is all fine but some properties have to be added for all entities so that's when it becomes a nightmare to go through 50+ different tables and add new properties.
There's no way to find all entities which share any given property e.g. no way to find all schools/colleges or universities that have a geography dept (without querying schools,uni's and colleges separately).
Removing a property is equally painful.
No standards for defining properties in individual tables. Same property can exist with different name or data type in another table.
No way to link or group points based on their properties (somehow related to point 2).
We are thinking to redesign the whole database but without DBA's help and lack of professional DB design experience we are really struggling.
Another problem we're facing with the new design is that there are lot of shared attributes/properties between entities.
For example:
An entity called "university" has 100+ attributes. Other entities (e.g. hospitals,banks,etc) share quite a few attributes with universities for example atm machines, parking, cafeteria etc etc.
We dont really want to have properties in separate table [and then linking them back to entities w/ foreign keys] as it will require us adding/removing manually. Also generalizing properties will results in groups containing 50+ attributes. Not all records (i.e. entities) require those properties.
So with keeping that in mind here's what we are thinking about the new design:
Have separate tables for each entity containing some basic info e.g. id,name,etc etc.
Have 2 tables attribute type and attribute to store properties information.
Link each entity (or a table if you like) to attribute using a many-to-many relation.
Store addresses in different table called addresses link entities via foreign keys.
We think this will allow us to be more flexible when adding, removing or querying on attributes.
This design, however, will result in increased number of joins when fetching data e.g.to display all "attributes" for a given university we might have a query with 20+ joins to fetch all related attributes in a single row.
We desperately need to know some opinions or possible flaws in this design approach.
Thanks for your time.
In trying to generalize your question without more specific examples, it's hard to truly critique your approach. If you'd like some more in depth analysis, try whipping up an ER diagram.
If your data model is changing so much that you're constantly adding/removing properties and many of these properties overlap, you might be better off using EAV.
Otherwise, if you want to maintain a relational approach but are finding a lot of overlap with properties, you can analyze the entities and look for abstractions that link to them.
Ex) My Db has Puppies, Kittens, and Walruses all with a hasFur and furColor attribute. Remove those attributes from the 3 tables and create a FurryAnimal table that links to each of those 3.
Of course, the simplest answer is to not touch the data model. Instead, create Views on the underlying tables that you can use to address (5), (4) and (2)
1 cannot be an issue. There is one place where your objects are defined. Everything else is generated/derived from that. Just refactor your code until this is the case.
2 is solved by having a metamodel, where you describe which properties are where. This is probably needed for 1 too.
You might want to totally avoid the problem by programming this in Smalltalk with Seaside on a Gemstone object oriented database. Then you can just have objects with collections and don't need so many joins.