Got a question regarding big size SSAS cube:
Scenario: .. building product revenue cube on SQL Server Analysis Server (SSAS) 2012:
the product hierarchy is a dimension with 7 levels: cat1, cat2, … cat7(product_id);
The fact table is at product id level, key = (customer id + bill month + product id), this gives around 200M rows per month, may still go up.
Concern: Now the prototype is having only 3 month data, the cube browsing are slow already. When drag and drop 2 -3 dimensions, it might take couple of minutes or more.
Late on, there are 30+ other dimensions need to add in, and extend to 1-2 years data, 2-3B rows. So we are concerning the performance, can that much data be handled with an acceptable browsing speed?
Question: Is there any better way (design, performance tuning) for above scenario?
E.g. another thinking is to make the fact table flat, i.e. make it customer level, key = (customer id + bill month), one customer only have one row per bill month. While adding 40+ columns, one column for each product, that way the fact row count per month with go down to say 5M. But we can’t build/browse the product dimension anymore (that is an important requirement), can we?
Thanks in advance for any light shedding.
Related
I have been asked the above question but i know only its meaning that its the finest level of data. for example, if you have name in fact table, then its detail such email, phone number,etc can be found in dimensions table. I have sample dataset and its area level analysis which i have worked on, Please explain granularity of data based upon this data.
Dataset:
itemid
item
RID
Rname
Area
Time_Availability>70%
6222589
peanut banana
1000
Cafe adda
gachibowli
True
6355784
chocolate fudge
2000
Santosh hotel
Attapur
False
Area level of analysis of restaurant on boarding to a platform
Area
Total Ingested restaurants
Available
items_Available >=5
Gachibowli
5
4
2
Attapur
5
4
2
Thank you
The granularity of a fact table is the minimum set of attributes that will uniquely identify a measure.
For example (and I'm not saying this is a real world example), if you had a sales fact table and there could only be one sale per customer per day then "per customer per day" would be the granularity of that fact table. You might have other dimensions such as the store that the sale occurred in or the country where the transaction took place - but these would not affect the granularity if you could still only have one sale per customer per day, regardless of which store or country that transaction took place in
I am new to Power BI and data-base management and I want clarify for myself how Power BI works in reference to my last two questions (Database modelling Bridge Table , Power BI Report Bridge Table ). I have a main_table with firm specific information each year which is connected to an end_table that contains some quantitative information (e.g. sales data). The tables are modelled as a 1:N relationship, so that I do not have to store the same values twice, which I thought is a good thing to do in data modelling.
I want to aggregate the value column of end table over the group column Year. I am surprised that to my understanding Power BI sums up the value column within the end table when I would expect the aggregation over the group variable in the connected tables
My basic example is based on this data and data model (you need to adjust the relationship manually):
main_table<-data.frame(id=1:20, FK_id=sample(1:2,20, replace=TRUE), Jahre=2016:2020)
main_table<-rbind(main_table,data.frame(id=21:25, FK_id=sample(2:3,5, replace=TRUE), Jahre=2015) )
end_table<-data.frame(id=1:3, value=c(10,20,30))
The first 5 rows of the data including all columns looks like this:
If I take out all row specific information and sum up over value. It will always show the sum of the end table, which is 60, in each Year.
Making the connection bi-directional does not help. It just sums up for the existing values of the end_table in each year. I get the correct results, if I add the value column to the main table using Related value = RELATED(end_table[value])
I am just wondering if there is another way to model or analyse this 1:N relationship in Power BI. This comes up frequently and it feels a bit tedious to always add the column using Related() in the main table while it would be intuitive to just click both columns and expect the aggregation to be based on the grouping variable.
In any case, just asking this and my other two questions helped me a lot.
This is a bit of a weird modeling situation (even though it's not terribly uncommon). In general, it's handy to build star schemas where you have dimension tables in 1:N relationships to fact table(s). E.g.
In this setup, the items from the dimension tables (e.g. year or customer) are used in the columns and rows in a visual and measures generally aggregate columns from the fact table (e.g. sales amount).
Your example inverts this. You are trying to sum over a column in your end table using the year as a dimension. As a result, it's not automatically behaving as you'd expect.
In order to get the result that you want, where Year is treated as a dimension, you need to write a measure that sums over Year as if it were a dimension. Since main_table is essentially a dimension table for Year (one unique row per year), you can write
SumValue = SUMX ( main_table, RELATED ( end_table[value] ) )
I do descriptive analytics and reporting at a company that sells a wide range of products. We record sales transactions and everytime an item is sold, the following is recorded:
Customer ID (each customer has a unique ID)
Product ID (each product has a unique ID)
Sale date
(Other fields are recorded too - location of purchase, quantity, payment type, etc.)
We sell a few big ticket items, and what I'm wondering is if it's possible to predict whether a customer will buy one of the big ticket items based on their purchase history, using transactional data as described above. We have about 2 million rows of sales data spanning seven years, and in that time maybe 14,000 big ticket items have been sold to 5,000 out of 50,000 customers.
I use SQL Server 2008 R2 which has the data mining feature. I did some brief reading on it but can't figure out what model would be best, or if it's something that's even doable. Can someone point me in the right direction to get started?
Not sure if the SQL server data mining feature is useful. I took a look at the one for SQL 2012 and decided it wasn't.
As for your prediction, this would be a supervised learning problem (just pick any simple algorithm), where each customer is a row and your features would be the different products. Your positive labels then would then be the rows of customers that had bought big ticket items.
what you are looking for is called sequential pattern mining and the specific technique that you are looking for is called discrete event prediction. However with that being said, I don't think you will be able to do what you want to do with an out of the box solution on sql server.
Let's say I have an ordering system which has a table size of around 50,000 rows and grows by about 100 rows a day. Also, say once an order is placed, I need to store metrics about that order for the next 30 days and report on those metrics on a daily basis (i.e. on day 2, this order had X activations and Y deactivations).
1 table called products, which holds the details of the product listing
1 table called orders, which holds the order data and product id
1 table called metrics, which holds a date field, and order id, and metrics associated.
If I modeled this in a star schema format, I'd design like this:
FactOrders table, which has 30 days * X orders rows and stores all metadata around the orders, product id, and metrics (each row represents the metrics of a product on a particular day).
DimProducts table, which stores the product metadata
Does my performance gain from a huge FactOrders table only needing one join to get all relevant information outweigh the fact that I increased by table size by 30x and have an incredible amount of repeated data, vs. the truly normalized model that has one extra join but much smaller tables? Or am I designing this incorrectly for a star schema format?
Do not denormalize something this small to get rid of joins. Index properly instead. Joins are not bad, joins are good. Databases are designed to use them.
Denormalizing is risky for data integrity and may not even be faster due to the much wider size of the tables. IN tables this tiny, it is very unlikely that denormalizing would help.
I'm writing an application that stores different types of records by user and day. These records are divided in categories.
When designing the database, we create a table User and then for each record type we create a table RecordType and a table Record.
Example:
To store data related to user events we have the following tables:
Event EventType
----- ---------
UserId Id
EventTypeId Name
Value
Day
Our boss pointed out (with some reason) that we're gonna store a lot of rows ( Users * Days ) and suggested an idea that seems a little crazy to me: Create a table with a column for each day of the year, like so:
EventTypeId | UserId | Year | 1 | 2 | 3 | 4 | ... | 365 | 366
This way we only have 1 row per user per year, but we're gonna get pretty big rows.
Since most ORMs (we're going with rails3 for this project) use select * to get the database records, aren't we optimizing something to "deoptimize" another?
What's the community thoughs about this?
This is a violation of First Normal Form. It's an example of repeating groups across columns.
Example of why this is bad: Write a query to find which day a given event occurred. You'll need a WHERE clause with 366 terms, separated by OR. This is tedious to write, and impossible to index.
Relational databases are designed to work well even if you have a lot of rows. Say you have 10000 users, and on average every user generates 10 events every day. After 10 years, you will have 10000*366*10*10 rows, or 366,000,000 rows. That's a fairly large database, but not uncommon.
If you design your indexes carefully to match the queries you run against this data, you should be able to have good performance for a long time. You should also have a strategy for partitioning or archiving old data.
That's breaks the DataBase normal forms principles
http://databases.about.com/od/specificproducts/a/normalization.htm
if it's applicable why don't you replace Day columns with a DateTime column in your event table with a default value (GetDate() we are talking about SQL)
then you could group by Date ...
I wouldn't do it. As long as you take the time to index the table appropriately, the database server should work well with tables that have lots of rows. If it's significantly slowing down your database performance, I'd start by making sure your queries aren't forcing a lot of full table scans.
Some other potential problems I see:
It probably will hurt ORM performance.
It's going to create maintainability problems on down the road. You probably don't want to be working with objects that have 366 fields for every day of the year, so there's probably going to have to be a lot of boilerplate glue code to keep track of.
Any query that wants to search against a range of dates is going to be an unholy mess.
It could be even more wasteful of space. These rows are big, and the number of rows you have to create for each customer is going to be the sum of the maximum number of times each different kind of event happened in a single day. Unless the rate at which all of these events happens is very constant and regular, those rows are likely to be mostly empty.
If anything, I'd suggest sharding the table based on some other column instead if you really do need to get the table size down. Perhaps by UserId or year?