I'm not sure if I can explain this well, I have two tables, one with the following columns TotalLoan, InterestRate, Classification, LendingRates and the other has the following columns, SumLoan, Classification, LendingRates.
The first table, let's call it Loans has values of loans of specific users and their loan totals plus their interest rates while the other table has the sum of the total of loans based on the Classification and LendingRates. This means that if two users have the Classification Individual and LendingRate as 1-5years, their loans will be summed up. Let me try to visualize them, here is the Loans table.
and here is the second table containing the summations of the values using the lending rates and classification for grouping.
For the sake of simplicity and also the fact that I'm dealing with real bank values here and I'm not supposed to share the company data online I created a fake summary on Excel, this is not the real data. The real data contains thousands of rows. So the formula is
(InterestRate * TotalLoan)/Sum of total.
Thereby for Individuals with Overdraft, you'll take the first expression to be
(12*34555)/12221222
then
(14*22322)/12221222
then
(6/76772)/76772 and so on...
Anyone has an idea how I can do this in Microsoft SQL. I'm seriously stumped here.
Related
I am new to Power BI and data-base management and I want clarify for myself how Power BI works in reference to my last two questions (Database modelling Bridge Table , Power BI Report Bridge Table ). I have a main_table with firm specific information each year which is connected to an end_table that contains some quantitative information (e.g. sales data). The tables are modelled as a 1:N relationship, so that I do not have to store the same values twice, which I thought is a good thing to do in data modelling.
I want to aggregate the value column of end table over the group column Year. I am surprised that to my understanding Power BI sums up the value column within the end table when I would expect the aggregation over the group variable in the connected tables
My basic example is based on this data and data model (you need to adjust the relationship manually):
main_table<-data.frame(id=1:20, FK_id=sample(1:2,20, replace=TRUE), Jahre=2016:2020)
main_table<-rbind(main_table,data.frame(id=21:25, FK_id=sample(2:3,5, replace=TRUE), Jahre=2015) )
end_table<-data.frame(id=1:3, value=c(10,20,30))
The first 5 rows of the data including all columns looks like this:
If I take out all row specific information and sum up over value. It will always show the sum of the end table, which is 60, in each Year.
Making the connection bi-directional does not help. It just sums up for the existing values of the end_table in each year. I get the correct results, if I add the value column to the main table using Related value = RELATED(end_table[value])
I am just wondering if there is another way to model or analyse this 1:N relationship in Power BI. This comes up frequently and it feels a bit tedious to always add the column using Related() in the main table while it would be intuitive to just click both columns and expect the aggregation to be based on the grouping variable.
In any case, just asking this and my other two questions helped me a lot.
This is a bit of a weird modeling situation (even though it's not terribly uncommon). In general, it's handy to build star schemas where you have dimension tables in 1:N relationships to fact table(s). E.g.
In this setup, the items from the dimension tables (e.g. year or customer) are used in the columns and rows in a visual and measures generally aggregate columns from the fact table (e.g. sales amount).
Your example inverts this. You are trying to sum over a column in your end table using the year as a dimension. As a result, it's not automatically behaving as you'd expect.
In order to get the result that you want, where Year is treated as a dimension, you need to write a measure that sums over Year as if it were a dimension. Since main_table is essentially a dimension table for Year (one unique row per year), you can write
SumValue = SUMX ( main_table, RELATED ( end_table[value] ) )
I just started in Data Warehouse modeling and I need help for the modeling of a problem.
Let me tell you the facts: I work on flight data (aeronautical data),
so I have two Excel (fact) files, linked together, one file 'order' and the other 'services'.
the 'order' file sets out a summary of each flight (orderId, departure date, arrival date, City of departure, City of arrival, total amount collected, etc.)
the 'services' file lists the services provided by flight (orderId, service name, quantity, amount / qty, etc.)
with a 1-n relationship (order-services) each order has n services
I already see some dimensions (Time, Location, etc ...). However, I would like to know how I could design my Data Warehouse, knowing that I have two fact files linked together by orderId.
I thought about it, and the star and snowflake schema do not work in my case (since I have two fact tables) and the galaxy schema requires to have dimensions in common, but I block it, is that I put the order table as a dimension and not as a fact table or I should rather put the services table as a dimension, but these are fact tables. I get a little confused.
How can I design my model?
First of all realize that in a star schema it is not a problem to have more fact tables that are connected - see the discussion here.
So the first draw will simple follow your two fact tables with the native provided dimensions.
Order is in one context a fact table, in other context a dimensional table for the service table.
Dependent on your expected queries you could find useful to denormalize some dimensions of the order table in the service table. So the service will have defined the departure date, arrival date etc. dimensions.
This will be done at the load time in the ETL job.
I will be somehow careful to denormalize the measures from order to service - which will basically eliminate the whole order table.
There will be no problem with the measure total amount collected if this is a redundant sum of the service amounts - you may safely get rid of it.
But you will need for sure the number of flights or number of people transported - those measure are better defined in the order fact table; you can not simple replicate them in the N rows for each service.
A workaround is possible, if you define a main service for each order and those measures are defined only in this row - in other rows the value is NULL. This could lead to unexpected results if queried naively, e.g. for number of flights per service.
So basically I'd start with the two fact tables and denormalize some dimensions to the services if this would help to optimize the queries.
I would start with one fact table of Services. This fact would include all of the dimensions you might associate with the Order including a degenerated dimension of OrderId.
Once this fact is built out and some information products are consuming it, return to the Order and re-evaluate it to see if there are any reporting needs which are not being served, or questions which are difficult to answer with the Services fact.
Joining two facts together is always a bad idea. Performance is terrible. You are always better off bring the dimensions from, in your case, Order to Services. Don't forget to include the context of the dimension in the column name and a corresponding role-playing dimension view for this context. E.G. OrderArrivalCity, OrderDepartureDate, OrderDepartureTime.
You can also get yourself a copy of Ralph Kimball's The Data Warehouse Toolkit
I have a data table of over 50,000 rows. The data contains SALES information for permutations of STORES, DATES, and PRODUCTS. However, the PRODUCTS are actually a combination of PRIMARY PRODUCTS (PP) and SECONDARY PRODUCTS (SP), where a sale QUANTITY of 1 PP should convert to the sale of 1 or more SPs. I have another sheet containing the CONVERSIONS of PP to SP containing the respective MULTIPLIERS (over 500 rows). PPs and SPs have a many-to-many relationship; PPs may convert to several different SPs, and the same SP may be converted from other PPs.
At the moment, only unconverted sales quantities exist for the PRODUCTS, and it's my job to convert those figures to each PP's respective SP if a MULTIPLIER exists.
Sample: https://i.stack.imgur.com/YdGHn.png
I am able to do that with the following SUMPRODUCT() formula, which appears more efficient than an array formula:
=SUMPRODUCT(
(Conversions[Multiplier]),
--(Conversions[SP]=[#Product]),
SUMIFS([Quantity],[Product],Conversions[PP],[Store],[#Store],[Date],[#Date])
)
However, given the size of my data set, it still takes forever to process. Is there a more efficient way to do this?
EDIT:
I tried wrapping the formula in a conditional so that SUMPRODUCT only evaluates if the Product in question can be found in the Conversion table as an SP (and it also now displays the values of PRODUCTS that don't have any conversions). This seems to have sped things up a little, but still nowhere near quick enough...
=IFERROR(IF(MATCH([#Product],Conversions[SP],0)>0,
SUMPRODUCT(
(Conversions[Multiplier]),
--(Conversions[SP]=[#Product]),
SUMIFS([Quantity],[Product],Conversions[PP],[Store],[#Store],[Date],[#Date])
),0),0)+[#Quantity]
if you have the possibility to import your data to a database. you then can work with indexed tables. should be faster.
I am creating a data model for customer invoices in a large data warehouse.
The following shows the fields on a typical invoice:
The following is the data model I worked out so far to model the invoices:
Conventional wisdom is that a large data warehouse should use a star schema, which means one fact table, but it seems that to model an invoice I would need two fact tables, as shown above. Would it be correct to use two fact tables?
I recommend you avoid multiple grain fact tables where possible.
Since Invoice Fact contains Total Shipping and Total Tax, to boil this down to Invoice Detail Fact, there are two basic options that I can think of:
Create Tax and Freight columns in your Invoice Detail fact and distribute amongst your items. This Kimball Tip suggests exactly that: http://www.kimballgroup.com/2001/07/01/design-tip-25-designing-dimensional-models-for-parent-child-applications/.
An alternative approach which has worked well for me is to create two new members in your product dimension. One for tax and one for freight. Then add these two line items to the fact just like a normal product with appropriate values.
When you analyse by Invoice ID you get the total including Tax & Freight. When you analyse by individual product you don't get a misleading Freight or Tax figure.
I have three tables - Sales Manager, Customer, and Order. Each sales manager has multiple customers, and each customer can have multiple orders.
I am interested in determining if certain attributes of sales manager and attributes of customer will lead to sales of a particular product (Let's say Product A Yes/no).
Suppose I have 3 sales managers, 10 customers, and 20 orders.
Should I structure the data set to have 3 rows, 10 rows or 20 rows. Please advise.
Also, will the decision tree, and classification algorithm automatically understand the hierarchical relationships among manager, customer and order?
Thanks.
I think you should make one big feature matrix out of it. Suppose you have tables
Sales Manager (id attr_1 ... attr_m)
Customer (id attr_1 ... attr_n sales_manager_id)
Order (id product_id_1 ... product_id_l customer_id)
Then it is most probably reasonable to create the matrix in the following form
Matrix:
product_id order_attr_1 ... order_attr_l customer_attr_1 ... customer_attr_n ... manager_attr_1 ... manager_attr_m
Now you have 20*l row matrix with all the attributes that are given for certain order.
In the simplest form you can use the following matrix for classification. In case of too many attributes maybe it is reasonable to use PCA first. Maybe you should try to use Weka and see, what turns out.
Considering your question about the hierarchical relations, then the classification algorithms will not understand them explicitly.
I would recommend this book here: Introduction to Data Mining, as it answers most of your questions.