What's a good approach to data warehouse design if requested reports require summarized information about the same dimensions (and at the same granularity) but the underlying data is stored in separate fact tables?
For example, a report showing total salary paid and total expenses reported for each employee for each year, when salary and expenses are recorded in different fact tables. Or a report listing total sales per month and inventory received per month for each SKU sold by a company, when sales comes from one fact table and receiving comes from another.
Solving this problem naively seems pretty easy: simply query and aggregate both fact tables in parallel, then stitch together the aggregated results either in the data warehouse or in the client app.
But I'm also interested in other ways to think about this problem. How have others solved it? I'm wondering both about data-warehouse schema and design, as well as making that design friendly for client tools to build reports like the examples above.
Also, does this "dimension sandwich" use-case have a name in canonical data-warehousing terminology? If yes that will make it easier to research via Google.
We're working with SQL Server, but the questions I have at this point are hopefully platform-neutral.
I learned today that this technique is called Drilling Across:
Drilling across simply means making separate queries against two or
more fact tables where the row headers of each query consist of
identical conformed attributes. The answer sets from the two queries
are aligned by performing a sort-merge operation on the common
dimension attribute row headers. BI tool vendors refer to this
functionality by various names, including stitch and multipass query.
Sounds like the naive solution above (query multiple fact tables in parallel and stitch together the results) is also the suggested solution.
More info:
Drilling Across - Kimball overview article
http://blog.oaktonsoftware.com/2011/12/three-ways-to-drill-across.html - SQL implementation suggestions for drilling across
Many thanks to #MarekGrzenkowicz for pointing me in the right direction to find my own answer! I'm answering it here in case someone else is looking for the same thing.
The "naive solution" you described is most of the times the preferred one.
A common exception is when you need to filter the detailed rows of one fact using another fact table. For example, "show me the capital-tieup (stock inventory) for the articles we have not sold this year". You cannot simply sum up the capital-tieup in one query. In this case a consolidated fact can be a solution, if you are able to express both measures on a common grain.
Related
Good Afternoon,
I am new to OLAP (and dabatases in general). I need to write a query to retrieve the TOP 10 sales for a year by product.
To do so I would have to work with three tables (simplified just to show the main structure):
LOCATION(location_id,country,....,city,....,district_id),
SALES_A(shop_id, product_id,....., unit_sales,....., unit_price),
SALES_B(shop_id, product_id,...., unit_sales, unit_price),
SHOP(shop_id,....,location_id,.....)
The structure of the query I need to write using RANK() should be clear from the examples I have seen but my main doubt comes from a comment made in this video https://www.youtube.com/watch?v=pmpzsws4xwA&t=12s about the point of using Analytical Functions to avoid using self joins.
Since all the examples I have seen so far use only a single table, and because of the comment made on the linked video, my question is...within the context of a Data Warehouse, is it OK to do joins between the different tables needed, and then apply analytical functions to the resulting table?.
Does this incurr in a performance penalty and should be done in a different way?
Many thanks in advance
Can someone explain the terms of drilling down and drilling across to me I need to illustrate them both using ROLAP query code or pseudo code. I have tried researching about it but I find it complicated to understand, if someone could explain or point me in the right direction I would be grateful!
Drill down and Drill across are terms generally used in the reporting tool. The idea is that the report may show your fact table summarized and you can drill down for more details. I.e. the fact table holds salary costs on employee level. The report however is by default summed up to company level. Then you can in the report drill down, i.e. ask the reporting tool to show you the records it used to sum upp to the row you are looking at. This could mean that you i.e. dbl click on the company and are shown the salary cost per department, dbl click on a department and are shown salary cost per employee.
Drill across is a feature where the reporting tool lets you get data from another fact table, i.e. when you show the salary cost for an employee you could drill across to another fact table holding other information on that employee.
There are plenty of variation in how the tools implement these features and what they mean by them, but the general principals could be found here, as explained by Kimbal:
Drilling Down, Up, and Across
I'm an Excel user trying to solve this one problem, and the only efficient way I can think of is do it by a database. I use arrays in programming VBA/Python and I've queried from databases before, but never really designed a database. So I'm here to look for suggestion on how to structure this db in Access.
Anyway, I currently maintain a sheet of ~50 economics indicators for ~100 countries. It's a very straightforward sheet, with
Column headers: GDP , Unemployment , Interest Rate, ... ... ... Population
And Rows:
Argentina
Australia
...
...
Yemen
Zambia
etc.
I want to take snapshots of this sheet so I can see trends and run some analysis in the future. I thought of just keep duplicating the worksheet in Excel but it just feels inefficient.
I've never designed a database before. My question would be what's the most efficient way to store these data for chronological snapshots? In the future I will probably do these things:
Queue up a snapshot for day mm-dd-yy in the past.
Take two different data point of a metric, of however many countries, and track the change/rate of change etc.
Once I can queue them well enough I'll probably do some kind of statistical analysis, which just requires getting the right data set.
I feel like I need to create an individual table for each country and add a row to every country table every time I take a snapshot. I'll try to play with VBA to automate this.
I can't think of any other way to do this with less tables? What would you suggest? Is it a recommended practice to use more than a dozen tables for this task?
There are a couple of ways of doing this,
Option 1
Id suggest you probably only need a single table, something akin to,
Country, date_of_snapshot, columns 1-50 (GDP etc..)
Effective you would add a new row for each day and each country,
Option 2
You could also use a table atructured as below though this would require more complex queries which may be too much for access,
Country, datofsnapshot, factor, value
with each factor GDP etc... getting a row for each date and country
i am reading about the "entity attribute value model" which sort of reminds me of an star-schema which you use in data warehousing.
One table has all the facts (even if you mix apples,bananas e.g. date of farming, weight, price, color,type,name) and a bunch of tables holding the details (e.g. infected_with _banana_virus_type, apple_specific_acid_level)
I do this in both aproaches, so I can't see a difference in these to words?
Please enlighten me. CHEERS
In all approaches you have entities, attributes and values. Everything reduces to this logically. Since everything has entities, attributes and values, you can always claim that everything is the same. All data structures are -- from that point of view -- identical.
Please draw a diagram of a star schema. With a fact (say web site GET requests) and some dimensions like Time, IP Address, Requested Resource Path, and session User.
Actually draw the actual diagram, please. Don't read the words, look at the picture of five tables.
After drawing that picture, draw a single EAV table.
Actually draw the picture with entity, attribute and value columns. Don't read the words. Look at the picture of one table.
Okay?
Now write down all the differences between the two pictures. Number of tables. Number of columns. Data types of each column. All the differences.
We're not done.
Write a SQL query to count GET requests by day of the week for a given user using the star schema. Actually write the SQL. It's a three-table join. With a GROUP BY and a WHERE
Try and write a SQL query to count GET requests by day of week for the EAV table.
Okay?
Now write down all the differences between the two queries. Complexity of the SQL, for example. Performance of the SQL. Time required to write the SQL.
Now you know the differences.
We've an SQL Server DB design time scenario .. we've to store data about different Organizations in our database (i.e. like Customer, Vendor, Distributor, ...). All the diff organizations share the same type of information (almost) .. like Address details, etc... And they will be referred in other tables (i.e. linked via OrgId and we have to lookup OrgName at many diff places)
I see two options:
We create a table for each organization like OrgCustomer, OrgDistributor, OrgVendor, etc... all the tables will have similar structure and some tables will have extra special fields like the customer has a field HomeAddress (which the other Org tables don't have) .. and vice-versa.
We create a common OrgMaster table and store ALL the diff Orgs at a single place. The table will have a OrgType field to distinguish among the diff types of Orgs. And the special fields will be appended to the OrgMaster table (only relevant Org records will have values in such fields, in other cases it'll be NULL)
Some Pros & Cons of #1:
PROS:
It helps distribute the load while accessing diff type of Org data so I believe this improves performance.
Provides a full scope for accustomizing any particular Org table without effecting the other existing Org types.
Not sure if diff indexes on diff/distributed tables work better then a single big table.
CONS:
Replication of design. If I have to increase the size of the ZipCode field - I've to do it in ALL the tables.
Replication in manipulation implementation (i.e. we've used stored procedures for CRUD operations so the replication goes n-fold .. 3-4 Inert SP, 2-3 SELECT SPs, etc...)
Everything grows n-fold right from DB constraints\indexing to SP to the Business objects in the application code.
Change(common) in one place has to be made at all the other places as well.
Some Pros & Cons of #2:
PROS:
N-fold becomes 1-fold :-)
Maintenance gets easy because we can try and implement single entry points for all the operations (i.e. a single SP to handle CRUD operations, etc..)
We've to worry about maintaining a single table. Indexing and other optimizations are limited to a single table.
CONS:
Does it create a bottleneck? Can it be managed by implementing Views and other optimized data access strategy?
The other side of centralized implementation is that a single change has to be tested and verified at ALL the places. It isn't abstract.
The design might seem a little less 'organized\structured' esp. due to those few Orgs for which we need to add 'special' fields (which are irrelevant to the other tables)
I also got in mind an Option#3 - keep the Org tables separate but create a common OrgAddress table to store the common fields. But this gets me in the middle of #1 & #2 and it is creating even more confusion!
To be honest, I'm an experienced programmer but not an equally experienced DBA because that's not my main-stream job so please help me derive the correct tradeoff between parameters like the design-complexity and performance.
Thanks in advance. Feel free to ask for any technical queries & suggestions are welcome.
Hemant
I would say that your 2nd option is close, just few points:
Customer, Distributor, Vendor are TYPES of organizations, so I would suggest:
Table [Organization] which has all columns common to all organizations and a primary key for the row.
Separate tables [Vendor], [Customer], [Distributor] with specific columns for each one and FK to the [Organization] row PK.
The sounds like a "supertype/subtype relationship".
I have worked on various applications that have implemented all of your options. To be honest, you probably need to take account of the way that your users work with the data, how many records you are expecting, commonality (same organisation having multiple functions), and what level of updating of the records you are expecting.
Option 1 worked well in an app where there was very little commonality. I have used what is effectively your option 3 in an app where there was more commonality, and didn't like it very much (there is more work involved in getting the data from different layers all of the time). A rewrite of this app is implementing your option 2 because of this.
HTH