Apologies in advance if this is hard to follow, but my question is more conceptual than technical, for those of you out there who have some experience designing this kind of thing.
I'm trying to decide the best way to structure my feeder tables and queries to connect metrics with their measures and objectives.
My metrics database has two primary feeder tables:
tblDissem (with Organization/Project/SubProject columns being
relevant)
tblVolume (With Organization/Project/SubProject/Analyzed/Shared
columns being relevant)
There are approximately 50 measures organized into 10 objectives in tblMeasures, with foreign keys for OrganizationID, ProjectID, SubProjectID for each measure. The 10 objectives in themselves aren't distinct enough where I can simply create a query for each (it'd have to be a union query after the fact).
The metrics for each measure are based on one of the following:
Count of tblDissem, organized by
Organization/Project/SubProject/Fiscal Quarter
Sum of tblVolume.Analyzed, organized by
Organization/Project/SubProject/Fiscal Quarter
Sum of tblVolume.Shared, organized by
Organization/Project/SubProject/Fiscal Quarter
As it is now, I have select queries set up for each of the two feeder tables converting dates to fiscal quarters, and crosstab queries for each of the above broken out by quarter for each organization/project/subproject.
The challenge is in getting these organized by objective. I figured I could create a query for each group of measures, then use union queries to organize each into their proper objective, OR I add a field to tblMeasures that lists the measure as either a DissemCount, AnalyzedSum, or SharedSum measure, which I'd somehow build into another query to automatically group the query results. Or, maybe use a lookup field in a way I haven't considered yet.
I'm open to any ideas, and apologies for being so abstract. Thanks in advance...I'm just not an expert when it comes to the how of relating information.
Related
I am working on a PowerBI report that consists of multiple dashboards. The data needed is from a single table with 100K rows of data in DWH . The table stores all the variables and values for different stores, as shown in the picture below.
Currently, we are creating new table in data mart for each separate dashboard, such as total profit in each country, total number of staff in each country etc. However, I realize I can do the same using Power Query without adding new tables for my data mart. So I am curious which approach is better?
And this leads to another question I always have, when we need a tranformed table for dashboard, shoud we create new tables in data mart, or should we do it in the BI tool such as PBI or Tableau? I think performance is a factor to be considered, but not sure about the other factors.
Appreciate if anyone can share your opinion.
Appreciate if anyone can share your opinion.
Given the amount of transformation that needs to occur, it would be worth doing this in the DWH. Power BI does well with a star schema, so it would be good to break out dimensions like country, store and date into their own tables.
You might also work the measures into a single fact table - or maybe two if some of the facts are transactional and others are semi-additive snapshot facts. i.e. profit vs. number of staff. Designed right, the model could support all of the dashboards, so you would not need a report table for each.
I just started in Data Warehouse modeling and I need help for the modeling of a problem.
Let me tell you the facts: I work on flight data (aeronautical data),
so I have two Excel (fact) files, linked together, one file 'order' and the other 'services'.
the 'order' file sets out a summary of each flight (orderId, departure date, arrival date, City of departure, City of arrival, total amount collected, etc.)
the 'services' file lists the services provided by flight (orderId, service name, quantity, amount / qty, etc.)
with a 1-n relationship (order-services) each order has n services
I already see some dimensions (Time, Location, etc ...). However, I would like to know how I could design my Data Warehouse, knowing that I have two fact files linked together by orderId.
I thought about it, and the star and snowflake schema do not work in my case (since I have two fact tables) and the galaxy schema requires to have dimensions in common, but I block it, is that I put the order table as a dimension and not as a fact table or I should rather put the services table as a dimension, but these are fact tables. I get a little confused.
How can I design my model?
First of all realize that in a star schema it is not a problem to have more fact tables that are connected - see the discussion here.
So the first draw will simple follow your two fact tables with the native provided dimensions.
Order is in one context a fact table, in other context a dimensional table for the service table.
Dependent on your expected queries you could find useful to denormalize some dimensions of the order table in the service table. So the service will have defined the departure date, arrival date etc. dimensions.
This will be done at the load time in the ETL job.
I will be somehow careful to denormalize the measures from order to service - which will basically eliminate the whole order table.
There will be no problem with the measure total amount collected if this is a redundant sum of the service amounts - you may safely get rid of it.
But you will need for sure the number of flights or number of people transported - those measure are better defined in the order fact table; you can not simple replicate them in the N rows for each service.
A workaround is possible, if you define a main service for each order and those measures are defined only in this row - in other rows the value is NULL. This could lead to unexpected results if queried naively, e.g. for number of flights per service.
So basically I'd start with the two fact tables and denormalize some dimensions to the services if this would help to optimize the queries.
I would start with one fact table of Services. This fact would include all of the dimensions you might associate with the Order including a degenerated dimension of OrderId.
Once this fact is built out and some information products are consuming it, return to the Order and re-evaluate it to see if there are any reporting needs which are not being served, or questions which are difficult to answer with the Services fact.
Joining two facts together is always a bad idea. Performance is terrible. You are always better off bring the dimensions from, in your case, Order to Services. Don't forget to include the context of the dimension in the column name and a corresponding role-playing dimension view for this context. E.G. OrderArrivalCity, OrderDepartureDate, OrderDepartureTime.
You can also get yourself a copy of Ralph Kimball's The Data Warehouse Toolkit
I have an application with several tables like users, stories, comments which consists fields like id, rating, text, is_deleted and so on.
There are >145 mil of comments, >7 mil of stories and >2.5 mil of users.
For each column in each table I have another table for storing versions, for example comments rating has table defined like this:
item_id uint64
timestamp int64
value int32
There are also the same tables for history of columns of other types like bool or string.
Now it works on postgres.
What I want to achieve: efficiently query the data, make distributions by day/hour and collect other statistics on my data.
The problem is that postgres is really slow, for example it takes >8 hours to make a distribution of comments by days and queries like select count(*) where timestamp > x and timestamp < y are also slow because postgres fetches all values and doesn't have any index for counts.
The question: Which database is more convenient for this kind of time-series data? Heard that there are influxdb, clickhouse and others, I don't have experience with any so it's hard to choose for me.
What you describe sounds like a data warehouse. Such a data warehouse needs careful modeling in any database system to work efficiently.
Typically, you have to pre-aggregate the data, for example per day, by using materialized views or triggers.
Heres a simple version of the website I'm designing: Users can belong to one or more groups. As many groups as they want. When they log in they are presented with the groups the belong to. Ideally, in my Users table I'd like an array or something that is unbounded to which I can keep on adding the IDs of the groups that user joins.
Additionally, although I realize this isn't necessary, I might want a column in my Group table which has an indefinite amount of user IDs which belong in that group. (side question: would that be more efficient than getting all the users of the group by querying the user table for users belonging to a certain group ID?)
Does my question make sense? Mainly I want to be able to fill a column up with an indefinite list of IDs... The only way I can think of is making it like some super long varchar and having the list JSON encoded in there or something, but ewww
Please and thanks
Oh and its a mysql database (my website is in php), but 2 years of php development I've recently decided php sucks and I hate it and ASP .NET web applications is the only way for me so I guess I'll be implementing this on whatever kind of database I'll need for that.
Your intuition is correct; you don't want to have one column of unbounded length just to hold the user's groups. Instead, create a table such as user_group_membership with the columns:
user_id
group_id
A single user_id could have multiple rows, each with the same user_id but a different group_id. You would represent membership in multiple groups by adding multiple rows to this table.
What you have here is a many-to-many relationship. A "many-to-many" relationship is represented by a third, joining table that contains both primary keys of the related entities. You might also hear this called a bridge table, a junction table, or an associative entity.
You have the following relationships:
A User belongs to many Groups
A Group can have many Users
In database design, this might be represented as follows:
This way, a UserGroup represents any combination of a User and a Group without the problem of having "infinite columns."
If you store an indefinite amount of data in one field, your design does not conform to First Normal Form. FNF is the first step in a design pattern called data normalization. Data normalization is a major aspect of database design. Normalized design is usually good design although there are some situations where a different design pattern might be better adapted.
If your data is not in FNF, you will end up doing sequential scans for some queries where a normalized database would be accessed via a quick lookup. For a table with a billion rows, this could mean delaying an hour rather than a few seconds. FNF guarantees a direct access lookup path for each item of data.
As other responders have indicated, such a design will involve more than one table, to be joined at retrieval time. Joining takes some time, but it's tiny compared to the time wasted in sequential scans, if the data volume is large.
I have like about 10 tables where are records with date ranges and some value belongin to the date range.
Each table has some meaning.
For example
rates
start_date DATE
end_date DATE
price DOUBLE
availability
start_date DATE
end_date DATE
availability INT
and then table dates
day DATE
where are dates for each day for 2 years ahead.
Final result is joining these 10 tables to dates table.
The query takes a bit longer, because there are some other joins and subqueries.
I have been thinking about creating one bigger table containing all the 10 tables data for each day, but final table would have about 1.5M - 2M records.
From testing it seems to be quicker (0.2s instead of about 1s) to search in this table instead of joining tables and searching in the joined result.
Is there any real reason why it should be bad idea to have a table with that many records?
The final table would look like
day DATE
price DOUBLE
availability INT
Thank you for your comments.
This is a complicated question. The answer depends heavily on usage patterns. Presumably, most of the values do not change every day. So, you could be vastly increasing the size of the database.
On the other hand, something like availability may change every day, so you already have a large table in your database.
If your usage patterns focused on one table at a time, I'd be tempted to say "leave well-enough alone". That is, don't make a change if it ain't broke. If your usage involved multiple updates to one type of record, I'd be inclined to leave them in separate tables (so locking for one type of value does not block queries on other types).
However, your usage suggests that you are combining the tables. If so, I think putting them in one row per day per item makes sense. If you are getting successive days at one time, you may find that having separate days in the underlying table greatly simplifies your queries. And, if your queries are focused on particular time frames, your proposed structure will keep the relevant data in the cache, giving room for better performance.
I appreciate what Bohemian says. However, you are already going to the lowest level of granularity and seeing that it works for you. I think you should go ahead with the reorganization.
I went down this road once and regretted it.
The fact that you have a projection of millions of rows tells me that dates from one table don't line up with dates from another table, leading to creating extra boundaries for some attributes because being in one table all attributes must share the same boundaries.
The problem I encountered was that the business changed and suddenly I had a lot more combinations to deal with and the number of rows blew right out, slowing queries significantly. The other problem was keeping the data up to date - my "super" table was calculated from the separate tables when ever they changed.
I found that keeping them separate and moving the logic into the app layer worked for me.
The data I was dealing with was almost exactly the same as yours except I had only 3
tables: I had availability, pricing and margin. The fact was that the 3 were unrelated, so date ranges never aligned, leasing to lots of artificial rows in the big table.