I'm new to MDX and trying to solve the following problem. Investigated calculated members, subselects, scope statements, etc but can't quite get it to do what I want.
Let's say I'm trying to come up with the MDX equivalent to the following SQL query:
SELECT SUM(netMarketValue) net,
SUM(CASE WHEN netMarketValue > 0 THEN netMarketValue ELSE 0 END) assets,
SUM(CASE WHEN netMarketValue < 0 THEN netMarketValue ELSE 0 END) liabilities,
SUM(ABS(netMarketValue)) gross
someEntity1
FROM (
SELECT SUM(marketValue) netMarketValue, someEntity1, someEntity2
FROM <some set of tables>
GROUP BY someEntity1, someEntity2) t
GROUP BY someEntity1
In other words, I have an account ledger where I hide internal offsetting transactions (within someEntity2), then calculate assets & liabilities after aggregating them by someEntity2. Then I want to see the grand total of those assets & liabilities aggregated by the bigger entity, someEntity1.
In my MDX schema I'd presumably have a cube with dimensions for someEntity1 & someEntity2, and marketValue would be my fact table/measure. I suppose i could create another DSV that did what my subquery does (calculating net), and simply create a cube with that as my measure dimension, but I wonder if there is a better way. I'd rather not have 2 cubes (one for these net calculations and another to go to a lower level of granularity for other use cases), since it will be a lot of duplicate info in my database. These will be very large cubes.
I think you should leave the aggregation logic to the cube--that's what it does best.
In your case, I would create an account dimension, and then I would add Account Intelligence. However, this only works for the Enterprise edition of SQL Server (2005 and above).
If you happen to have the standard edition, the canonical way to do this is to use unary operators.
That's the way we used to do it with Sql Server 2000, and here you have a great example.
I think what you want is not two cubes, but one cube with two fact tables (sometimes called a constellation schema). The question was written months ago so I won't elaborate more here unless someone asks for more info.
Related
I'm using Microsoft SQL Server and Excel. I have an issue I'm having problems getting my head round surrounding how to get some fields calculated so that it runs faster. I have a large data set that gets dropped into excel and stuck into a pivot table.
The table at it's simplest will contain a number of fields similar to the below.
Date user WorkType Count TotalTime
My issue is that I need to calculate an average in a particular way. Each user may have several worktypes on any given day. The formula I have is for each Date&User Sum(TotalTime)/Sum(Count) to get me the following
Date user Average
Currently I dump a select query into excel, apply formula to a column to get my averages then construct the pivot table using the personal details and the averages.
The calculation on over 20,000 rows however is about 5-7 minutes.
So my question is that possible to do that type of calculation in either SQL or Pivot table to cut down the processing time. I'm not very confident with Pivot tables, and I'm considered fairly inexperienced at SQL compared to here. I can manage bits of this but pulling it all together with the conditions of matching Date and User is beyond me right now.
I could parse the recordset into an array to do my calculations that way before it gets written to the spreadsheet, but I just feel that there should be a better way to achieve the same end.
Pre-calculated aggregates in SQL can go very wrong in an Excel Pivot.
Firstly, you can accidentally take an average of an average.
Secondly, once users start re-arranging the pivot you can get very strange sub-totals and totals.
Try to ensure you do all of your aggregation in one place.
If possible try to use SQL with SSRS, you can base a report on a parameterised stored procedure. Consequently you push all of the hard work onto the SQL box, and you restrict users from pivoting things around improperly.
SELECT Raw_Data.ID, Raw_Data.fldDate, CONVERT(datetime, DATEADD(second, SUM(DATEDIFF(second, 0, Raw_Data.TotalHandle)) / SUM(Raw_Data.[Call Count]), 0), 108) AS avgHandled
FROM Raw_Data
WHERE (Raw_Data.fldDate BETWEEN '05-12-2016' AND '07-12-2016')
GROUP BY Raw_Data.fldDate, Raw_Data.ID
For anyone interested here is the results of my searching. Thank you for the help that pointed me in the right direction. It seems quite clumsy with the conversions due to a time Datatype but it's working.
I have SSAS and SSRS 2008R2. The end goal is to get the report with Daily MarketValue for each Portfolio and Security Combination. MarketValue has SCOPE calculation to Select Last Existing for Date Dimension. If the SCOPE is removed the query still takes 6 min t complete. and with SCOPE statement it timeout after 1 hour. here is my query
SELECT
NON EMPTY
{[Measures].[MarketValue]} ON COLUMNS
,NON EMPTY
{
[Portfolio].[PortfolioName].[PortfolioName].ALLMEMBERS*
[Effective Date].[Effective Date].[Effective Date].ALLMEMBERS*
[Security].[Symbol].[Symbol].ALLMEMBERS
}
DIMENSION PROPERTIES
MEMBER_CAPTION
,MEMBER_UNIQUE_NAME
ON ROWS
FROM EzeDM
WHERE
(
[AsOn Date].[AsOn Date].&[2014-06-17T06:32:41.97]
,[GoldenCopy].[Frequency].&[Daily]
,[GoldenCopy].[GoldenCopyType].&[CitcoPLExposure]
,[GoldenCopy].[PointInTime].&[EOP]
,[GoldenCopy].[PositionType].&[Trade-Date]
);
The SCOPE statement I have for MarketValue Measure is
SCOPE
[Effective Date].[Effective Date].MEMBERS;
THIS =
Tail
(
(EXISTING
[Effective Date].[Effective Date].MEMBERS)
,1
).Item(0);
END SCOPE;
Security DIM has around 4K values. Portfolio DIM has around 100 Values and EffectiveDate DIM has around 400 values.
If I remove the EffectiveDate from the cross join the query is taking less than 2 seconds.
So far I have tried different combinations and found that the slowness is due to the cross join between DIM with large values in them. but then I am thinking is 4000 values in DIM is actually large? people must have done the same reporting efficiently right?
Is this a SCOPE calculation? If so why does it get slower only when EffectiveDate is in the cross join?
Appreciate any help.
EDIT:1
Adding some more details about the current environment if that helps :
We do not have Enterprise version and currently we do not have any plans to ask our clients to upgrade to Enterprise version.
Security Dimension has around 40 attribute but 2 of them will always have data and at most "up to 6" may have any data. not sure if Attribute being not used in MDX query still affects the query performance "regardless it has data or not"
After reading the "Chris Webb" blog on MDX query improvements I notice the property is true for ALL Attributes in ALL Dimension.
"AttributeHierarchyEnabled = True"
For testing I have marked FALSE to all except currently I am using.
I do not have any aggregations defined on cube and I have started with building Aggregations using "Design Aggregations" wizard. after that I profile the same reporting query and didn't see any tick for "get data from Aggregations" event.
So currently I am working on preparing/testing "Usage Based Aggregation"
EDIT:2
So I created the log table with 50% logging sampling and ran 15-20 different reporting queries Client is expecting to run and saw some data in log table. I used the Wizard for Usage Based Aggregation and let SSAS finds out Estimated Row Count.
it was strange that it did not generate any aggregations.
I also tried the approach of changing the Aggregation property to LastChild As Frank suggested and it worked great but then I realize I can not pick LastChild Value for MarketValue for all Dimension. it is Additive across Security Dimension but not across Time.
I would assume that getting rid of the whole SCOPE statement and instead setting the AggregateFunction property of the measures to LastChild or LastNonEmpty would speed up the calculation. This would require [Effective Date] to be the first dimension tagged as time, and you need SQL Server Enterprise edition for these AggregateFunctions to be available.
I am currently dealing with a scenario of whereby I have 2 SQL databases and I need to compare the changes in data in each of the tables in the 2 databases. The databases have the same exact tables between them however may have differing data in each of them. The first hurdle I have is how do I compare the data sets and the second challenge I have is once I have identified the differences how do I say for example update data missing in database 2 that may have existed in 1.
Why not use SQL Data Compare instead of re-inventing the wheel? It does exactly what you're asking - compares two databases and writes scripts that will sync in either direction. I work for a vendor that competes with some of their tools and it's still the one I would recommend.
http://www.red-gate.com/products/sql-development/sql-data-compare/
One powerful command for comparing data is EXCEPT. With this, you can compare two tables with same structure simply by doing the following:
SELECT * FROM Database1.dbo.Table
EXCEPT
SELECT * FROM Database2.dbo.Table
This will give you all the rows that exist in Database1 but not in Database2, including rows that exist in both but are different (because it compares every column). Then you can reverse the order of the queries to check the other direction.
Once you have identified the differences, you can use INSERT or UPDATE to transfer the changes from one side to the other. For example, assuming you have a primary key field PK, and new rows only come into Database2, you might do something like:
INSERT INTO Database1.dbo.Table
SELECT T2.*
FROM Database2.dbo.Table T2
LEFT JOIN Database1.dbo.Table T1 on T2.PK = T1.PK
WHERE T1.PK IS NULL -- insert rows that didn't match, i.e., are new
The actual methods used depend on how the two tables are related together, how you can identify matching rows, and what the sources of changes might be.
You also can look at the Data Compare feature in Visual Studio 2010 (Premium and higher)? I use it to make sure configuration tables in all my environments ( i.e. development, test and production ) are in sync. It made my life enormously easier.
You can select tables you want to compare, you can choose columns to compare. What I haven't learned to do though is how to save my selections for the future use.
You can do this with SQL Compare which is a great too for development but if you want to do this on a scheduled basis a better solution might be Simego's Data Sync Studio. I know it can do about 100m (30 cols wide) row compare on 16GB on an i3 iMac (bootcamp). In reality it is comfortable with 1m -> 20m rows on each side. It uses a column storage engine.
In this scenario it would only take a couple of minutes to download, install and test the scenario.
I hope this helps as I always look for the question mark to work out what someone is asking.
I would like to know how comparisons for IN clause in a DB work. In this case, I am interested in SQL server and Oracle.
I thought of two comparison models - binary search, and hashing. Can someone tell me what method does SQL server follow.
SQL Server's IN clause is basically shorthand for a wordier WHERE clause.
...WHERE column IN (1,2,3,4)
is shorthand for
...WHERE Column = 1
OR Column = 2
OR column = 3
OR column = 4
AFAIK there is no other logic applied that would be different from a standard WHERE clause.
It depends on the query plan the optimizer chooses.
If there is a unique index on the column you're comparing against and you are providing relatively few values in the IN list in comparison to the number of rows in the table, it's likely that the optimizer would choose to probe the index to find out the handful of rows in the table that needed to be examined. If, on the other hand, the IN clause is a query that returns a relatively large number of rows in comparison to the number of rows in the table, it is likely that the optimizer would choose to do some sort of join using one of the many join methods the database engine understands. If the IN list is relatively non-selective (i.e. something like GENDER IN ('Male','Female')), the optimizer may choose to do a simple string comparison for each row as a final processing step.
And, of course, different versions of each database with different statistics may choose different query plans that result in different algorithms to evaluate the same IN list.
IN is the same as EXISTS in SQL Server usually. They will give a similar plan.
Saying that, IN is shorthand for OR..OR as JNK mentioned.
For more than you possibly ever needed to know, see Quassnoi's blog entry
FYI: The OR shorthand leads to another important difference NOT IN is very different to NOT EXISTS/OUTER JOIN: NOT IN fails on NULLs in the list
Have you ever seen any of there error messages?
-- SQL Server 2000
Could not allocate ancillary table for view or function resolution.
The maximum number of tables in a query (256) was exceeded.
-- SQL Server 2005
Too many table names in the query. The maximum allowable is 256.
If yes, what have you done?
Given up? Convinced the customer to simplify their demands? Denormalized the database?
#(everyone wanting me to post the query):
I'm not sure if I can paste 70 kilobytes of code in the answer editing window.
Even if I can this this won't help since this 70 kilobytes of code will reference 20 or 30 views that I would also have to post since otherwise the code will be meaningless.
I don't want to sound like I am boasting here but the problem is not in the queries. The queries are optimal (or at least almost optimal). I have spent countless hours optimizing them, looking for every single column and every single table that can be removed. Imagine a report that has 200 or 300 columns that has to be filled with a single SELECT statement (because that's how it was designed a few years ago when it was still a small report).
For SQL Server 2005, I'd recommend using table variables and partially building the data as you go.
To do this, create a table variable that represents your final result set you want to send to the user.
Then find your primary table (say the orders table in your example above) and pull that data, plus a bit of supplementary data that is only say one join away (customer name, product name). You can do a SELECT INTO to put this straight into your table variable.
From there, iterate through the table and for each row, do a bunch of small SELECT queries that retrieves all the supplemental data you need for your result set. Insert these into each column as you go.
Once complete, you can then do a simple SELECT * from your table variable and return this result set to the user.
I don't have any hard numbers for this, but there have been three distinct instances that I have worked on to date where doing these smaller queries has actually worked faster than doing one massive select query with a bunch of joins.
#chopeen You could change the way you're calculating these statistics, and instead keep a separate table of all per-product stats.. when an order is placed, loop through the products and update the appropriate records in the stats table. This would shift a lot of the calculation load to the checkout page rather than running everything in one huge query when running a report. Of course there are some stats that aren't going to work as well this way, e.g. tracking customers' next purchases after purchasing a particular product.
This would happen all the time when writing Reporting Services Reports for Dynamics CRM installations running on SQL Server 2000. CRM has a nicely normalised data schema which results in a lot of joins. There's actually a hotfix around that will up the limit from 256 to a whopping 260: http://support.microsoft.com/kb/818406 (we always thought this a great joke on the part of the SQL Server team).
The solution, as Dillie-O aludes to, is to identify appropriate "sub-joins" (preferably ones that are used multiple times) and factor them out into temp-table variables that you then use in your main joins. It's a major PIA and often kills performance. I'm sorry for you.
#Kevin, love that tee -- says it all :-).
I have never come across this kind of situation, and to be honest the idea of referencing > 256 tables in a query fills me with a mortal dread.
Your first question should probably by "Why so many?", closely followed by "what bits of information do I NOT need?" I'd be worried that the amount of data being returned from such a query would begin to impact performance of the application quite severely, too.
I'd like to see that query, but I imagine it's some problem with some sort of iterator, and while I can't think of any situations where its possible, I bet it's from a bad while/case/cursor or a ton of poorly implemented views.
Post the query :D
Also I feel like one of the possible problems could be having a ton (read 200+) of name/value tables which could condensed into a single lookup table.
I had this same problem... my development box runs SQL Server 2008 (the view worked fine) but on production (with SQL Server 2005) the view didn't. I ended up creating views to avoid this limitation, using the new views as part of the query in the view that threw the error.
Kind of silly considering the logical execution is the same...
Had the same issue in SQL Server 2005 (worked in 2008) when I wanted to create a view. I resolved the issue by creating a stored procedure instead of a view.