I have created a number of type 1 dimensions to hold customer/subscription level details. These dimensions are very large compared to any other dimensions I am using with nearly a 1 to 1 relationship to facts. The dimensions are only being used to provide drillthrough details.
It's all working but the size of these dimensions is quite large and I'm running into some memory issues when processing. I'm wondering if there are some porperties I should be setting since these are only used for drillthrough? NonAggregateable?
Would it be better to include details as nonAggregateable Measures since there is nearly a 1 to 1 relationship?
An example would be SubscriptionDetail which has values like email, userUID, activationcode. If users are looking at the subscription fact they can drillthrough to pull these details.
You won't be able to use strings as measures, so Email will be out.
I have had success using hidden datetime measures for drillthrough though, to get the exact datetime when the fact table generally keys off to a date dimension.
If processing is an issue, and there is a 1:1 with the fact, does the dimension change historically? If not, have you tried ProcessAdd to only add the new rows? If you have enterprise SSIS there is a component for this, or you can generate your own XMLA and send it to the server as part of the processing: http://www.artisconsulting.com/Blogs/tabid/94/EntryID/3/Default.aspx
Related
I have a fairly large database in SQL Server. To illustrate my use case, suppose I have a mobile game and I want to report on user activity.
To start with I have a table that looks like:
userId
date
# Sessions
Total Session Duration
1
2021-01-01
3
55
1
2021-01-02
9
22
2
2021-01-01
6
43
I am trying to "add" information of each session into this data. The options I'm considering are:
Add the session data as a new column containing a JSON array with the data for each session
Create a table with all session data indexed by userId & date - and query this table as needed.
Is this possible in SQL Server? (my experience is coming from GCP's BigQuery)
Your question boils down to whether it is better to use nested data or to figure out a system of tables where every column of every table has a simple domain (text string, number, date, etc.).
It turns out that this question was being pondered by Ed Codd fifty years ago when he was proposing the first database system based on the relational model. He decided that it was worthwhile restricting relational databases to Normal Form, later renamed First Normal Form. He proved to his own satisfaction that this restriction wouldn't reduce the expressive power of the relational model. And it would make it easier to build the first relational dabase manager.
Since then, just about every relational or SQL database has conformed to First Normal Form, although there are ways to get around the restriction by storing one of various forms of data structures in one column of a table. JSON is an example.
You'll gain the flexibility you get with JSON, but you will lose the ability to specify the data you want to retrieve using the various clauses of the SELECT statement, clauses like INNER JOIN or WHERE, among others. This loss could be deal killer.
If it were me, I would go with the added table approach, and analyze the session data down to one of more tables with simple columns. But you may find that JSON decoders are just as powerful, and that doing full table scans are worth the time taken.
In the application I am working on, we have data grids that have the capability to display custom views of the data. As a point of reference, we modeled this feature using the concept of views as they exist in SharePoint.
The custom views should have the following capabilities:
Be able to define which subset of columns (of those that are
available) should be displayed in the view.
Be able to define one or
more filters for retrieving data. These filters are not constrained
to use only the columns that are in the result set but must use one
of the available columns. Standard logical conditions and operators
apply to these filters. For example, ColumnA Equals Value1 or
ColumnB >= Value2.
Be able to define a set of columns that the data will be sorted by. This set of columns can be one or more columns
from the set of columns that will be returned in the result set.
Be
able to define a set of columns that the data will be grouped by.
This set of columns can be one or more columns from the set of
columns that will be returned in the result set.
I have application code that will dynamically generate the necessary SQL to retrieve the appropriate set of data. However, it appears to perform poorly. When I run across a poorly performing query, my first thought is to determine where indexes might help. The problem here is that I won't necessarily know which indexes need to be created as the underlying query could retrieve data in many different ways.
Essentially, the SQL that is currently being used does the following:
Creates a temporary table variable to hold the filtered data. This table contains a column for each column that should be returned in the result set.
Inserts data that matches the filter into the table variable.
Queries the table variable to determine the total number of rows of data.
If requested, determines the grouping values of the data in the table variable using the specified grouping columns.
Returns the requested page of the requested page size of data from the table variable, sorted by any specified sort columns.
My question is what are some ways that I may improve this process? For example, one idea I had was to have my table variable only contain the columns of data that are used to group and sort and then join in the source table at the end to get the rest of the displayed data. I am not sure if this would make any difference which is the reason for this post.
I need to support versions 2014, 2016 and 2017 of SQL Server in addition to SQL Azure. Essentially, I will not be able to use a specific feature of an edition of SQL Server unless that feature is available in all of the aforementioned platforms.
(This is not really an "answer" - I just can't add comments yet because my reputation score isn't high enough yet.)
I think your general approach is fine - essentially you are making a GUI generator for SQL. However a few things:
This type of feature is best suited for a warehouse or read only replica database. Do not build this on a live production transactional database. There are permutations that you haven't thought of that your users will find that will kill your database (it's also true from a warehouse standpoint, but they usually don't have response time expectations as a transactional database)
The method you described for doing paging is not efficient from a database standpoint. You are essentially querying, filtering, grouping, and sorting the same exact dataset multiple times just to cherry pick a few rows each time. If you have the data cached, that might be ok, but you shouldn't make that assumption. If you have the know how, figure out how to snapshot the entire final data set with an extra column to keep the data physically sorted in the order the user requested. That way you can quickly query the results for your paging.
If you have a Repository/DAL layer, design your solution so that in the future certain combinations of tables/columns can utilize hardcoded queries/stored procedures. There will inevitably be certain queries that pop up that cause you performance issues and you may have to build a custom solution for specific queries in order to get the desired performance that can't be obtained by your dynamic sql
A bit of a tricky question - I might just have to do it through VBA with a proper script, however if someone actually has a complicated answer, (let's be honest I don't think there's a super simple formula for this) I'm taker. I'd rather do as much as I can through formulas. I've attached a sample.
The data: I have data that relates to countries. In each country, you can have multiple sites. For each site, you may or not have different distributions. When those distributions meet a given criteria, I want to tally up that as a "break" & count how many by countries, sites, etc.
How it works: I'm using array formulas with sumproduct() for this. The nice thing is that you can easily add criteria, each criteria returns your 0/1 so when you multiply them it gives you the array you need to sum up to see how many breaks you have.
The problem: I am unable to format the formula so that I can account for each site being counted only once in the case where the same site has 2 different distribution types and both meet the break criteria. If both distributions meet the break criteria, I don't want to record that as 2 breaks, otherwise I may end up with more sites with breaks recorded than the number of sites. Part of the problem is how I account for the unicness of sites:
(tdata[siteid]>"")/COUNTIF(tdata[siteid],tdata[siteid] &"")
This is actually a bit of a hack, in the sense that as opposed to other formulas it doesn't return 0/1 but possibly fractions. They do add up correctly and do allow me to, say, count the number of sites correctly, but the array isn't formated as 0/1 therefore when multiplied with other 0/1 arrays it messes up the results....
I control the data, so I have some leeway. I work with tables (as can be seen) and VBA is already used. I could sort the source tables if that helps. Source data:
1 row = 1 distribution for 1 site on 1 month
The summary table per country I linked is based on those source data.
Any idea?
EDIT - Filtering for distribution is not really an option. I do already have an event-based filters for the source data, and I can already calculate rightly the indicator for filtered data by distributions. But I also need to display global data (which is currently not working). Also there are other indicators that need to be calculate which won't work if I filter the data (it's big dashboard).
EDIT2: In other words, I need to find some way to account for the fact that if the same criteria (break or not) is found in 2 sites with the same siteid but 2 different distributions, I want to count that as 1 break only. While keeping in mind that if one distribution has a break (and the other not), I still want to record it as 1 site with break in that country.
EDIT3: I've decided to make a new table, that summarizes the data for each site individually (each of which may have more than once distribution). Then I can calculate global stuff from that.
My take home message from this: I think that when you have many level of data (e.g. countries, sites, with some kind of a sub-level with distributions) in Excel formulas, it's difficult NOT to summarize the data in intermediate tables for the level of analysis at which you want to focus. E.g. in my case, I am interested in country-level analysis, which is 2 "levels" above the distribution level. This means that there will be "duplication" of data from a site-level perspective. You may be able to navigate around this, but I think by far the simpler solution is to suck it up and make an intermediate table. I does shorten significantly your formulas as well.
I don't mark this as a solution because it's not what I was looking for. Still open to better suggestions allowing to work only with formulas....
File: https://www.dropbox.com/sh/4ofctha6qhfgtqw/AAD0aPJXr__tononRTpKc1oka?dl=0
Maybe the following can help.
First, you filter the entries which don't meet the criteria regarding the distribution.
In a second step, you sort the table from A to Z based on the column siteid.
Then you add an extra column after the last on with the formula =C3<>C4, where column C contains the siteid entries. In that way all duplicates are denoted by a FALSE value in the helper column.
After that you filter the FALSE values in this column.
You then get unique site ids.
In case I got your question wrong, I would be glad about an update in order to try to help you.
There is a star schema that contains 3 dimensions (Distributor, Brand, SaleDate) and a fact table with two fact columns: SalesAmountB measured in boxes as the integer type and SalesAmountH measured in hectolitres as the numeric type. The end user wants to select which fact to show in a report. The report is going to be presented via SharePoint 2010 PPS.
So help me please determine which variant is suitable for me the most:
1) Add a new dimension like "Units" with two values Boxes, Hectolitres and use the in-built filter for this dim. (The fact data types are incompatible though)
2) Make two separate tables for the two facts and build two cubes. Then select either as the datasource.
3) Leave the model as it is and use the PPS API in SharePoint to select the fact to show.
So any ideas?
I think the best way to implement this is by using separate field for SalesAmountB and SalesAmountH in fact table. Then creating 2 separate measure in BIDS and controlling the visibility through MDX. By doing this, you can avoid complexity of duplicating whole data or even creating separate cubes.
i am reading about the "entity attribute value model" which sort of reminds me of an star-schema which you use in data warehousing.
One table has all the facts (even if you mix apples,bananas e.g. date of farming, weight, price, color,type,name) and a bunch of tables holding the details (e.g. infected_with _banana_virus_type, apple_specific_acid_level)
I do this in both aproaches, so I can't see a difference in these to words?
Please enlighten me. CHEERS
In all approaches you have entities, attributes and values. Everything reduces to this logically. Since everything has entities, attributes and values, you can always claim that everything is the same. All data structures are -- from that point of view -- identical.
Please draw a diagram of a star schema. With a fact (say web site GET requests) and some dimensions like Time, IP Address, Requested Resource Path, and session User.
Actually draw the actual diagram, please. Don't read the words, look at the picture of five tables.
After drawing that picture, draw a single EAV table.
Actually draw the picture with entity, attribute and value columns. Don't read the words. Look at the picture of one table.
Okay?
Now write down all the differences between the two pictures. Number of tables. Number of columns. Data types of each column. All the differences.
We're not done.
Write a SQL query to count GET requests by day of the week for a given user using the star schema. Actually write the SQL. It's a three-table join. With a GROUP BY and a WHERE
Try and write a SQL query to count GET requests by day of week for the EAV table.
Okay?
Now write down all the differences between the two queries. Complexity of the SQL, for example. Performance of the SQL. Time required to write the SQL.
Now you know the differences.