How to determine cubes which use particular dimension? - sql-server

Performing changes to dimensions is always a risk for me if I don't know exactly which cubes will be affected. Is there a more elegant way to do this than checking every cube one by one or creating an external documentation?
I would generally like to know if there is a way to do it, because we use a wide range of versions, but especially for SQL Server 2000 and 2008.

Dimensions are actually realted to Measure Groups (whihc belong to cubes, of course)
You can check these realtions with this query:
SELECT *
FROM $SYSTEM.MDSCHEMA_MEASUREGROUP_DIMENSIONS
WHERE CUBE_NAME = 'YOUR_CUBE_NAME'

Related

Choosing the right DBM-like C++ library for sequential data

I am trying to choose a database for a newly developing application. There are so many alternatives and it’s so easy to choose a wrong one. First of all, there is a requirement to not use database servers. A required database should be a static or dynamic C++ library. The data that needs to be stored is an array of records. They vary but are fixed for a given dataset (so they can be stored in a table). The information in each row could be from several hundred bytes up to several megabytes. And a number of rows may be millions for now and expected to grow.
The index of the row could be used as a key. No need to maintain a separate key column.
Data is inserted sequentially. Read access will be performed only by iterating all the data or some segment of it sequentially (May need to iterate with steps like each 5th).
I don’t think that relational DBs are good feet for many reasons.
a. They are mostly server-based. I know about SQLite but as far as I know, it stores data in one file which I assume may lead to issues related to maximum file size.
b. We don’t need the power that SQL provides instead we would like to have more flexibility in stored data types.
There are Key/Value non-SQL dbms like BerkeleyDB, RocksDB, or something like luxio for lighter alternatives. The functionality they provide is more than enough for the task. And this might be the right choice however I don’t know how well they are optimized for such case where we have continuous integer keys. The associative key access (which is not required for us) may have some overhead in performance.
I know there are some type of non-SQL databases called “wide-column” which I am not familiar with. However, the name sounds like it is perfect for our task. All databases I can find are server of claud based. If you know dbm-like library for such type of database please advise.
I am not experienced in databases so please correct me if I am wrong in any of 3 above stamens.
If your row data can grow to megabytes, and you're talking about only millions of records, maybe just figure out a way to lay it out in a filesystem? If you need a more database-like index, use SQLite for the keys, and have the data records point to a location on the filesystem. This kind of thing will be far quicker to implement and get right than trying to do it all in one giant database.

Compare two datasets from different databases using VB.NET

For data assurance, my task is two compare two datasets from different databases. Currently I am performing the cell-by-cell value comparison which is a Brute Force method and is consuming lot of time.
I would like to know if there are any methods which would save my time and memory space, which is able to provide a result indication "Tables are identical" or "Tables are not identical".
Thank you for your assistance.
How about creating a checksum for each table and compare the them?
Something like:
SELECT CHECKSUM_AGG(CHECKSUM(*)) FROM TableX
This might need a ORDER BY to be more precise.
If they are from different sources, there is no other way than comparing them cell by cell AFAIK. However I can suggest you something that will probably increasing comparison speed by many folds. If your DataTables have identical structures, which they hopefully should since you're already comparing them cell by cell, try comparing ItemArray of each pair of rows instead of accessing them by column index or column names (or row properties if you're using strongly-typed DataSets). This will hopefully give you much better results.
If you're using .NET 3.5 or above, this line should do it:
Enumerable.SequenceEqual(row1.ItemArray, row2.ItemArray);

To use or not to use computed columns for performance and maintainability

I have a table where am storing a startingDate in a DateTime column.
Once i have the startingDate value, am supposed to calculate the
number_of_days,
number_of_weeks
number_of_months and
number_of_years
all from the startingDate to the current date.
If you are going to use these values in two or more places in the application and you do care much about the applications response time, would you rather make the calculations in a view or create computed columns for each so you can query the table directly?
Computed columns are easy to maintain and provide an ideal solution to your problem – I have used such a solution recently. However, be aware the values are calculated when requested (when they are SELECTed), not when the row is INSERTed into the table – so performance might still be an issue. This might be acceptable if you can off-load work from the application server to the database server. Views also don’t exist until they are requested (unless they are materialised) so, again, there will be an overhead at runtime, but, again it’s on the database server, not the application server.
Like nearly everything: It depends.
As #RedX suggest it probably not much of a performance difference either way, so it becomes a question of how will use them. To me this is more of a feel thing.
Using them more than once doesn't wouldn't necessary drive me immediately to either a view or computed columns. If I only use them in a few places or low volume code paths I might calc them in-line in those places or use a CTE. But if the are in wide spread or heavy use I would agree with a view or computed column.
You would also want them in a view or cc if you want them available via ORM tools.
Am I using those "computed columns" individual in places or am I using them in sets? If using them in sets I probably want a view of the table that shows included them all.
When i need them do I usually want them associated with data from a particular other table? If so that would suggest a view.
Am I basing updates on the original table of those computed values? If so then I want computed columns to avoid joining the view in these case.
Calculated columns may seem an easy solution at first, but I have seen companies have trouble with them because when they try to do ETL with CDC for real-time Change Data Capture with tools like Attunity it will not recognize the calculated columns since the values are not there permanently. So there are some issues. Also if the columns will be retrieve many, many times by users, you will save time in the long run by putting that logic in the ETL tool or procedure and write it once to the database instead of calculating it many times for each request.

DISTINCT COUNT basket analysis between multiple measures with shared dimension

Let's say I have a cube with two different distinct count measures, call them Measure1 and Measure2. Both of these measures contain a common dimension, Dimension1, which is counted by both measures.
What I need to do is return a distinct count of Dimension1 members that exist in both Measure1 and Measure2, after appropriate filtering on each measure as required.
I can define MDX queries for both Measure1 and Measure2 individually and get distinct counts, but I need to be able to "overlap" the result to avoid double-counting the members that exist in both sets.
Note: in the actual scenario, there are more than 2 measures involved, and all MDX queries will be dynamically constructed (the user defines which measures and dimension criteria are included).
Can this be done in SSAS/MDX? If not, is there another Microsoft tool/feature that can? The minimum requirement for the system is SQL Server 2008 R2 Standard Edition.
Honestly I have no idea where to start. Google turned up nothing like this (I saw some basket analysis stuff involving a single measure, but I'm unsure if or how to apply that to my scenario). I'm not an SSAS/MDX/BI expert by any means.
There are two alternatives that I can think of:
Use DRILLTHROUGH using the individual MDX queries and (essentially) COUNT DISTINCT the results.
Use T-SQL on the data warehouse source database. (May be difficult to account for all scenarios efficiently.)
We do have a requirement to also be able to drillthrough, so I'll probably have to implement solution #1 anyway, but it would be nice to have a more efficient way to obtain just the counts, as counts will be needed far more frequently.
I would add a Distinct Count measure based on the Dimension1 Key attribute. I would present it in Excel 2010+ using the Sets MDX feature to filter on Measure1, 2 etc.
I never did find an MDX solution for this.
I went ahead with a solution that queries the data warehouse directly, and it's working pretty well so far after some performance tweaks. This approach may not be suitable for all applications, but it looks like it will work for our particular scenario.
I would recomend union function either on a SQL Server side (create view froom two tables) or at a SSAS side (create a single measure but with diferent partitions from diferent sources (for example, Partition1 - for Credits, Partition2 - for Deposits).
For second way, initialy over this "monstrous" decision you need make
simple Measure1 using SUM function. And after, ckick on Measure1 and choose "Create new measure" using DistionctCount function.
So SSAS would make a separate new Measure Group with Measure2 using DistionctCount function.
It must work perfectly.
Lets simplify the problem statement. You want the count of customers who bought both bread and eggs or who have a toyota and honda. I faced this issue a long time back and came up with a query design. The performance for these queries was not good. By the nature of these queries they are opening the fact to grain level. Hence all aggreagtion benifits lost.
Here is the code, I am counting the customers based on their names, who ordered
ClassicVestS or HLMountainTire and other products
with
member [Measures].[CustomersWhoBoughtClassicVestS] as
count(
intersect(
{nonempty(
existing ([Customer].[Customer].children),[Measures].[Internet Order Count]
)},
{extract(nonempty( ([Customer].[Customer].children* [Product].[Product].&[471]),[Measures].[Internet Order Count]),[Customer].[Customer])}
)
)
member [Measures].[CustomersWhoBoughtHLMountainTire] as
count(
intersect(
{nonempty(
existing ([Customer].[Customer].children),[Measures].[Internet Order Count]
)},
{extract(nonempty( ([Customer].[Customer].children* [Product].[Product].&[537]),[Measures].[Internet Order Count]),[Customer].[Customer])}
)
)
Select {[Measures].[CustomersWhoBoughtClassicVestS],[Measures].[CustomersWhoBoughtHLMountainTire]
} on columns ,
{ nonempty( [Product].[Product].children
,[Measures].[Internet Order Count]) }
on rows
from [Adventure Works]

Clustering Lat/Longs in a Database

I'm trying to see if anyone knows how to cluster some Lat/Long results, using a database, to reduce the number of results sent over the wire to the application.
There are a number of resources about how to cluster, either on the client side OR in the server (application) side .. but not in the database side :(
This is a similar question, asked by a fellow S.O. member. The solutions are server side based (ie. C# code behind).
Has anyone had any luck or experience with solving this, but in a database? Are there any database guru's out there who are after a hawt and sexy DB challenge?
please help :)
EDIT 1: Clarification - by clustering, i'm hoping to group x number of points into a single point, for an area. So, if i say cluster everything in a 1 mile / 1 km square, then all the results in that 'square' are GROUP'D into a single result (say ... the middle of the square).
EDIT 2: I'm using MS Sql 2008, but i'm open to hearing if there are other solutions in other DB's.
I'd probably use a modified* version of k-means clustering using the cartesian (e.g. WGS-84 ECF) coordinates for your points. It's easy to implement & converges quickly, and adapts to your data no matter what it looks like. Plus, you can pick k to suit your bandwidth requirements, and each cluster will have the same number of associated points (mod k).
I'd make a table of cluster centroids, and add a field to the original data table to indicate what cluster it belonged too. You'd obviously want to update the clustering periodically if your data is at all dynamic. I don't know if you could do that with a stored procedure & trigger, but perhaps.
*The "modification" would be to adjust the length of the computed centroid vectors so they'd be on the surface of the earth. Otherwise you'd end up with a bunch of points with negative altitude (when converted back to LLH).
If you're clustering on geographic location, and I can't imagine it being anything else :-), you could store the "cluster ID" in the database along with the lat/long co-ordinates.
What I mean by that is to divide the world map into (for example) a 100x100 matrix (10,000 clusters) and each co-ordinate gets assigned to one of those clusters.
Then, you can detect very close coordinates by selecting those in the same square and moderately close ones by selecting those in adjacent squares.
The size of your squares (and therefore the number of them) will be decided by how accurate you need the clustering to be. Obviously, if you only have a 2x2 matrix, you could get some clustering of co-ordinates that are a long way apart.
You will always have the edge cases such as two points close together but in different clusters (one northernmost in one cluster, the other southernmost in another) but you could adjust the cluster size OR post-process the results on the client side.
I did a similar thing for a geographic application where I wanted to ensure I could cache point sets easily. My geohashing code looks like this:
def compute_chunk(latitude, longitude)
(floor_lon(longitude) * 0x1000) | floor_lat(latitude)
end
def floor_lon(longitude)
((longitude + 180) * 10).to_i
end
def floor_lat(latitude)
((latitude + 90) * 10).to_i
end
Everything got really easy from there. I had some code for grabbing all of the chunks from a given point to a given radius that would translate into a single memcache multiget (and some code to backfill that when it was missing).
For movielandmarks.com I used the clustering code from Mike Purvis, one of the authors of Beginning Google Maps Applications with PHP and AJAX. It builds trees of clusters/points for different zoom levels using PHP and MySQL, storing it in the database so that recall is very fast. Some of it may be useful to you even if you are using a different database.
Why not testing multiple approaches?
translate the weka library in .NET CLI with IKVM.NET
add an assembly resulted from your code and weka.dll (use ilmerge) into your database
Make some tests, that is. No specific clustering works better than anyone else.
I believe you can use MSSQL's spatial data types. If they are similar to other spatial data types I know, they will store your points in a tree of rectangles, and then you can go to the lower-resolution rectangles to get implicit clusters.
If you end up wanting to explore Geohash's (which were invented at exactly the same time you posted this question), here's a more fleshed-out implementation of Geohash related functions for SQL Server's TSQL in which you might be interested.
QalGeohash-TSQL
I have used the Integer version of the Geohash extensively to cluster results to reduce data sent to a client for a limited viewport.

Resources