I am struggling in figuring out how to create a star schema from multiple source tables. I work at a trading firm so the data is related to user trading activity. The issue I am having is that our datasets do not have primary ids for every field that could be a dimension. Instead, we usually relate our data together using the combination of date and account number. Here is an example of 3 source tables...
I would like to turn this into a star schema, something that looks like ...
Is my only option to denormalize my source tables into one wide table (joining trades to position on account number and date, and joining the users table on account number), create keys for each dimension, then re normalizing it into the star schema? Are star schema's ever built from multiple source tables?
Star schemas are almost always created from multiple source tables.
The normal process is:
Populate your dimension tables
Create a temporary/virtual fact record using your source data
Using this fact record, look up the relevant dimension keys
Write the actual fact record to your target fact table
Data-warehousing is about query speed. The data-warehouse should not be concerned with data integrity. IT SHOULD NOT CLEAN OR CORRECT BAD DATA. It only needs to gather all the data together into a single record to present to the model for analysis. Denormalizing the data is how this is done.
In a star schema, dimensions do not know about each other and have no relationships with other dimensions. In a snowflake, dimensions are related to other dimensions. That is the primary difference between star and snowflake.
All the metadata options for events are rolled up into dimensions and used for slicing/filtering. All the measurable/calculation data for an event are in the event fact, along with a reference to the dimension(s) containing the relevant metadata. The Metadata/Dimension is reused across multiple fact records.
Based on the limited example you've provided, I'd suggest you research degenerate dimensions and junk dimensions. Your Trade and Position data may need to be turned into a fact and a dimension (degenerate), and some of your flag attributes may be best placed into a junk dimension.
You should also make sure your dimension keys are clear. You should not have multiple paths to a dimension (accountnumber: trade -> position -> user & trade -> user ) as that will cause inconsistent results when querying depending on which relationship you traverse.
I have a model with three fact tables and three dimensions. Each fact table can individually relate to each dimension, this works fine. But the three dimensions are in this schema not related and therefore cannot be used with each other by the client.
I have tried many solutions, one of them was merging the DimPerson, DimDepartment and DimDistrict by Crossjoin to get "all possible combinations". But given the number of rows in each of these dimensions, the task takes too long.
Any ideas? Or am I going about this the wrong way?
Here is the schema:
We have lots of models like this. The most common shared dimension is Date. We filter Date on every dashboard using the dimension and all the Fact tables return the filtered values. Be sure to use the Dimension columns for filtering, not the Fact table columns. Best practice is to hide the Fact columns (PersonID, DeptID, DistrictID) so these columns are only available via the Dimensions.
I'm trying to decide which way to go if I have an app that needs to be able to change the db schema based on the user input.
For example, if I have a "car" object that contains car properties, like year, model, # of doors etc, how do I store it in the DB in such a way, that the user should be able to add new properties?
I read about EAV tables and they seem right for this thing, but the problem is that queries will get pretty complicated when I try to get a list of cars filtered by a set of properties.
Could I generate the tables dynamically instead? I see that Sqlite has support for ADD COLUMN, but how fast is it when the table reaches many records? And it looks like there's no way to remove a column. I have to create a new table without the column I want to remove, and copy the data from the old table. That's certainly slow on large tables :(
I will assume that SQLite (or another relational DBMS) is a requirement.
EAVs
I have worked with EAVs and generic data models, and I can say that the data model is very messy and hard to work with in the long run.
Lets say that you design a datamodel with three tables: entities, attributes, and _entities_attributes_:
CREATE TABLE entities
(entity_id INTEGER PRIMARY KEY, name TEXT);
CREATE TABLE attributes
(attribute_id INTEGER PRIMARY KEY, name TEXT, type TEXT);
CREATE TABLE entity_attributes
(entity_id INTEGER, attribute_id INTEGER, value TEXT,
PRIMARY KEY(entity_id, attribute_id));
In this model, the entities table will hold your cars, the attributes table will hold the attributes that you can associate to your cars (brand, model, color, ...) and its type (text, number, date, ...), and the _entity_attributes_ will hold the values of the attributes for a given entity (for example "red").
Take into account that with this model you can store as many entities as you want and they can be cars, houses, computers, dogs or whatever (ok, maybe you need a new field on entities, but it's enough for the example).
INSERTs are pretty straightforward. You only need to insert a new object, a bunch of attributes and its relations. For example, to insert a new entity with 3 attributes you will need to execute 7 inserts (one for the entity, three more for the attributes, and three more for the relations.
When you want to perform an UPDATE, you will need to know what is the entity that you want to update, and update the desired attribute joining with the relation between the entity and its attributes.
When you want to perform a DELETE, you will also need to need to know what is the entity you want to delete, delete its attributes, delete the relation between your entity and its attributes and then delete the entity.
But when you want to perform a SELECT the thing becomes nasty (you need to write really difficult queries) and the performance drops horribly.
Imagine a data model to store car entities and its properties as in your example (say that we want to store brand and model). A SELECT to query all your records will be
SELECT brand, model FROM cars;
If you design a generic data model as in the example, the SELECT to query all your stored cars will be really difficult to write and will involve a 3 table join. The query will perform really bad.
Also, think about the definition of your attributes. All your attributes are stored as TEXT, and this can be a problem. What if somebody makes a mistake and stores "red" as a price?
Indexes are another thing that you could not benefit of (or at least not as much as it would be desirable), and they are very neccesary as the data stored grows.
As you say, the main concern as a developer is that the queries are really hard to write, hard to test and hard to maintain (how much would a client have to pay to buy all red, 1980, Pontiac Firebirds that you have?), and will perform very poorly when the data volume increases.
The only advantage of using EAVs is that you can store virtually everything with the same model, but is like having a box full of stuff where you want to find one concrete, small item.
Also, to use an argument from authority, I will say that Tom Kyte argues strongly against generic data models:
http://tkyte.blogspot.com.es/2009/01/this-should-be-fun-to-watch.html
https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:10678084117056
Dynamic columns in database tables
On the other hand, you can, as you say, generate the tables dynamically, adding (and removing) columns when needed. In this case, you can, for example create a car table with the basic attributes that you know that you will use and then add columns dynamically when you need them (for example the number of exhausts).
The disadvantage is that you will need to add columns to an existing table and (maybe) build new indexes.
This model, as you say, also has another problem when working with SQLite as there's no direct way to delete columns and you will need to do this as stated on http://www.sqlite.org/faq.html#q11
BEGIN TRANSACTION;
CREATE TEMPORARY TABLE t1_backup(a,b);
INSERT INTO t1_backup SELECT a,b FROM t1;
DROP TABLE t1;
CREATE TABLE t1(a,b);
INSERT INTO t1 SELECT a,b FROM t1_backup;
DROP TABLE t1_backup;
COMMIT;
Anyway, I don't really think that you will need to delete columns (or at least it will be a very rare scenario). Maybe someone adds the number of doors as a column, and stores a car with this property. You will need to ensure that any of your cars have this property to prevent from losing data before deleting the column. But this, of course depends on your concrete scenario.
Another drawback of this solution is that you will need a table for each entity you want to store (one table to store cars, another to store houses, and so on...).
Another option (pseudo-generic model)
A third option could be to have a pseudo-generic model, with a table having columns to store id, name, and type of the entity, and a given (enough) number of generic columns to store the attributes of your entities.
Lets say that you create a table like this:
CREATE TABLE entities
(entity_id INTEGER PRIMARY KEY,
name TEXT,
type TEXT,
attribute1 TEXT,
attribute1 TEXT,
...
attributeN TEXT
);
In this table you can store any entity (cars, houses, dogs) because you have a type field and you can store as many attributes for each entity as you want (N in this case).
If you need to know what the attribute37 stands for when type is "red", you would need to add another table that relates the types and attributes with the description of the attributes.
And what if you find that one of your entities needs more attributes? Then simply add new columns to the entities table (attributeN+1, ...).
In this case, the attributes are always stored as TEXT (as in EAVs) with it's disadvantages.
But you can use indexes, the queries are really simple, the model is generic enough for your case, and in general, I think that the benefits of this model are greater than the drawbacks.
Hope it helps.
Follow up from the comments:
With the pseudo-generic model your entities table will have a lot of columns. From the documentation (https://www.sqlite.org/limits.html), the default setting for SQLITE_MAX_COLUMN is 2000. I have worked with SQLite tables with over 100 columns with great performance, so 40 columns shouldn't be a big deal for SQLite.
As you say, most of your columns will be empty for most of your records, and you will need to index all of your colums for performance, but you can use partial indexes (https://www.sqlite.org/partialindex.html). This way, your indexes will be small, even with a high number of rows, and the selectivity of each index will be great.
If you implement a EAV with only two tables, the number of joins between tables will be less than in my example, but the queries will still be hard to write and maintain, and you will need to do several (outer) joins to extract data, which will reduce performance, even with a great index, when you store a lot of data. For example, imagine that you want to get the brand, model and color of your cars. Your SELECT would look like this:
SELECT e.name, a1.value brand, a2.value model, a3.value color
FROM entities e
LEFT JOIN entity_attributes a1 ON (e.entity_id = a1.entity_id and a1.attribute_id = 'brand')
LEFT JOIN entity_attributes a2 ON (e.entity_id = a2.entity_id and a2.attribute_id = 'model')
LEFT JOIN entity_attributes a3 ON (e.entity_id = a3.entity_id and a3.attribute_id = 'color');
As you see, you would need one (left) outer join for each attribute you want to query (or filter). With the pseudo-generic model the query will be like this:
SELECT name, attribute1 brand, attribute7 model, attribute35 color
FROM entities;
Also, take into account the potential size of your _entity_attributes_ table. If you can potentially have 40 attributes for each entity, lets say that you have 20 not null for each of them. If you have 10,000 entities, your _entity_attributes_ table will have 200,000 rows, and you will be querying it using one huge index. With the pseudo-generic model you will have 10,000 rows and one small index for each column.
It all depends on the way in which your application needs to reason about the data.
If you need to run queries which need to do complicated comparisons or joins on data whose schema you don't know in advance, SQL and the relational model are rarely a good fit.
For instance, if your users can set up arbitrary data entities (like "car" in your example), and then want to find cars whose engine capacity is greater than 2000cc, with at least 3 doors, made after 2010, whose current owner is part of the "little old ladies" table, I'm not aware of an elegant way of doing this in SQL.
However, you could achieve something like this using XML, XPath etc.
If your application has a set on data entities with known attributes, but users can extend those attributes (a common requirement for products like bug trackers), "add column" is a good solution. However, you may need to invent a custom query language to allow users to query those columns. For instance, Atlassian Jira's bug tracking solution has JQL, a SQL-like language for querying bugs.
EAV is great if your task is to store and then show data. However, even moderately complex queries become very hard in an EAV schema - imagine how you'd execute my made up example above.
For your use case, a document oriented database like MongoDB would do great.
Another option that I haven't seen mentioned above is to use denormalized tables for the extended attributes. This is a combination of the pseudo-generic model and the dynamic columns in database tables. Instead of adding columns to existing tables, you add columns or groups of columns into new tables with FK indexes to the source table. Of course, you'll want a good naming convention (car, car_attributes_door, car_attributes_littleOldLadies)
Your selection problem becomes that of applying a LEFT OUTER JOIN to include the extended attributes that you want to include.
Slower than normalized, but not as slow as EAV.
Adding new extended attributes becomes a problem of adding a new table.
Harder than EAV, easier/faster than modifying table schema.
Deleting attributes becomes a problem of dropping whole tables.
Easier/faster than modifying table schema.
These new attributes can be strongly typed.
As good as modifying table schema, faster than EAV or generic columns.
The biggest advantage to this approach that I can see is that deleting unused attributes is quite easy compared to any of the others via a single DROP TABLE command. You also have the option to later normalize often-used attributes into larger groups or into the main table using a single ALTER TABLE process rather than one for each new column you were adding as you added them, which helps with the slow LEFT OUTER JOIN queries.
The biggest disadvantage is that you're cluttering up your table list, which admittedly is often not a trivial concern. That and I'm not sure how much better LEFT OUTER JOIN's actually perform than EAV table joins. It's definitely closer to EAV join performance than normalized table performance.
If you're doing a lot of comparisons/filters of values that benefit greatly from strongly typed columns, but you add/remove these columns frequently enough to make modifying a huge normalized table intractable, this seems like a good compromise.
I would try EAV.
Adding columns based on user input doesn't sounds nice to me and you can quickly run out of capacity. Queries on very flat table can also be a problem. Do you want to create hundreds of indexes?
Instead of writing every thing to one table, I would store as many as possible common properties (price, name , color, ...) in the main table and those less common properties in an "extra" attributes table. You can always balance them later with a little effort.
EAV can performance well for small to middle sized data set. Since you want to use SQLlite, I guess it's not be a problem.
You may also want to avoid "over" normalizing your data. With the cheap storage
we currently have, you can use one table to store all "Extra" attributes, instead of two:
ent_id, ent_name, ...
ent_id, attr_name, attr_type, attr_value ...
People against EAV will say its performance is poor on large database. It's sure that it won't performance as well as normalized structure but you don't want to change structure on a 3TB table either.
I have a low quality answer, but possible, that came from HTML tags that are like : <tag width="10px" height="10px" ... />
In this dirty way you will have just one column as a varchar(max) for all properties say it Props column and you will store data in it like this:
Props
------------------------------------------------------------
Model:Model of car1|Year:2010|# of doors:4
Model:Model of car2|NewProp1:NewValue1|NewProp2:NewValue2
In this way all works will go to the programming code in business layer with using some functions like concatCustom that get an array and return a string and unconcatCustom that get a string and return an array.
For more validity of special characters like ':' and '|', I suggest '#:#' and '#|#' or something more rare for splitter part.
In a similar way you can use a text or binary field and store an XML data in the column.
I'm developing an Asset management application.
Looking through the excel tracker that was being used previously, I was able to identify some attributes that were common to all categories of assets (basically non-technical attributes such as Purchase Order No. , Warranty Info etc.) for which I think I will make a separate table.
But when storing technical-attributes, there are many categories of assets for which I need only one or two additional attributes to be stored.
Should a make a single table for all these attributes and store NULLs wherever applicable or should I make a separate table each category containing just the asset ID and the addition columns? Which approach is better/more pragmatic?
Is cluttering the database with too many tables ok? I have around 10 such categories.
There are 3 known approaches to this:
Single table
In this model, you have a single table with all known columns, and allow them to be null for types that don't have that attribute. This gives you a simple database, and fairly simple SQL, but doesn't allow support for common features that relational databases give you, like insisting on non-null columns for a data type, or creating unique indices where that makes sense.
It also tends to lead to messy SQL, with developers forgetting over time what columns mean, so you could get a column being used for multiple purposes.
It does make it easy to join to other tables - so if you have an asset and a purchase related to that asset, the "purchase" table joins to the "asset" table on "assetID".
Table per subtype
In this case, you build a table for each subtype, and enforce the data characteristics of that subtype with not null, unique etc.
This creates a clearer separation of subtypes, and is less likely to degrade into big ball of mud, but makes joins very hard - to join from "purchase" to "asset", you have to know which table holds that particular asset.
Common table for common fields, table per subtype
In this model, you have a single table for the fields that are common between subtypes - you say you've identified this already - and have further tables for each subtype to store the unique attributes.
This solves the joining problem between "asset" and "purchase", keeps the data pretty self-describing.
It does mean client logic needs to implement the "join asset_master to asset_subtype" issue.
I prefer option 3 - it's the best trade-off between maintainability and managability.
Databases should be able to handle lots of columns and lots of tables, so both approaches should work from that perspective.
If you don't have any additional requirements, I'd use the single table approach. It is the easiest, and the only thing you are loosing is the ability to put not null constraints on the fields that exist only form some categories
I have a table that has a bunch of fields. The fields can be broken into logical groups - like a job's project manager info. The groupings themselves aren't really entity candidates as they don't and shouldn't have their own PKs.
For now, to group them, the fields have prefixes (PmFirstName for example) but I'm considering breaking them out into multiple tables with 1:1 relations on the main table.
Is there anything I should watch out for when I do this? Is this just a poor choice?
I can see that maybe my queries will get more complicated with all the extra joins but that can be mitigated with views right? If we're talking about a table with less than 100k records is this going to have a noticeable effect on performance?
Edit: I'll justify the non-entity candidate thoughts a little further. This information is entered by our user base. They don't know/care about each other. So its possible that the same user will submit the same "projectManager name" or whatever which, at this point, wouldn't be violating any constraint. Its for us to determine later on down the pipeline if we wanna correlate entries from separate users. If I were to give these things their own key they would grow at the same rate the main table grows - since they are essentially part of the same entity. At no pt is a user picking from a list of available "project managers".
So, given the above, I don't think they are entities. But maybe not - if you have further thoughts please post.
I don't usually use 1 to 1 relations unless there is a specific performance reason for it. For example storing an infrequently used large text or BLOB type field in a separate table.
I would suspect that there is something else going on here though. In the example you give - PmFirstName - it seems like maybe there should be a single pm_id relating to a "ProjectManagers" or "Employees" table. Are you sure none of those groupings are really entity candidates?
To me, they smell unless for some rows or queries you won't be interested in the extra columns. e.g. if for a large portion of your queries you are not selecting the PmFirstName columns, or if for a large subset of rows those columns are NULL.
I like the smells tag.
I use 1 to 1 relationships for inheritance-like constructs.
For example, all bonds have some basic information like CUSIP, Coupon, DatedDate, and MaturityDate. This all goes in the main table.
Now each type of bond (Treasury, Corporate, Muni, Agency, etc.) also has its own set of columns unique to it.
In the past we would just have one incredibly wide table with all that information. Now we break out the type-specific info into separate tables, which gives us much better performance.
For now, to group them, the fields have prefixes (PmFirstName for example) but I'm considering breaking them out into multiple tables with 1:1 relations on the main table.
Create a person table, every database needs this. Then in your project table have a column called PMKey which points to the person table.
Why do you feel that the group of fields are not an entity candidates? If they are not then why try to identify them with a prefix?
Either drop the prefixes or extract them into their own table.
It is valuable splitting them up into separate tables if they are separate logical entities that could be used elsewhere.
So a "Project Manager" could be 1:1 with all the projects currently, but it makes sense that later you might want to be able to have a Project Manager have more than one project.
So having the extra table is good.
If you have a PrimaryFirstName,PrimaryLastName,PrimaryPhone, SecondaryFirstName,SecondaryLastName,SEcondaryPhone
You could just have a "Person" table with FirstName, LastName, Phone
Then your original Table only needs "PrimaryId" and "SecondaryId" columns to replace the 6 columns you previously had.
Also, using SQL you can split up filegroups and tables across physical locations.
So you could have a POST table, and a COMMENT Table, that have a 1:1 relationship, but the COMMENT table is located on a different filegroup, and on a different physical drive with more memory.
1:1 does not always smell. Unless it has no purpose.