Benchmarks comparing SSAS with SQL summary tables - sql-server

Can anyone point me at a performance benchmark comparing SSAS with querying your own rollup tables in SQL?
What I'd like to understand is if the benefit from SSAS is entirely maintenance/convenience (managing your own rollup tables may become unmaintainable with a large number of dimensions) or if there is some magic in the MOLAP storage itself that makes it faster than equivalent relational SQL queries with equivalent pre-built aggregates.

It's as much about the ease of slicing and dicing as it is storing the aggregations. Sure, you can create your own rollup tables and subsequent queries that handles 10 different dimensions but I would rather use Excel/SSMS and just drag and drop. I can also point my users to the cube and say 'have fun'. I don't need to facilitate their every need, they can self-service.
As for the benchmark, that is dependent on your data warehouse schema, indexes, calculations, etc. Basically, you would need to do the analysis yourself to see if it is better for your situation.

Sorry I don't have link for that. But the rollup tables become complicated very soon as type and number of dimensions increase.
This document specifies the performance guide for MOLAP. Page 78 shows this diagram:
As you can see building the aggregate if you have hierarchies.

Related

will/can solr, elasticsearch and kibana out perform sql server's cube technologies?

This trio of products came up as an alternative to sql server for searching and presenting analytics over a survey based pattern of about 100 million data points. A survey pattern is basically questions x answers x forms x studies and in our case very qa oriented about how people did their jobs. About 7% of our data points cannot be quantified because they are comments.
So, can this community envision (perhaps provide a link to a success story) leveraging these products for slicing and dicing metrics (via drag and drop) along with comments over 100 million data points and out performing sql server? Our metrics can be $'s, scores, counts, hours depending on the question. We have at least two hierarchies, one over people and the other over depts. Both are temporal in that depending on the date, have different relationships (aka changing dimensions). In all there are about 90 dimensions for each data point depending on how you count the hierarchy levels.
You cant compare SQL engine and elasticsearch/solr.
It depends how you want to query it: join or not, full text search or not etc...
Like Thomas said, it depends. Depends on your data and how you want to query it. In general, for text oriented data then NoSQL will be better and provide more functionalities than SQL. However, if I understand correctly, only 7% of your data is text focused (the comments), so I assume the rest is structured.
In terms of performance, it depends what kind of text analysis you want to do and what kind of queries you're wanting to recreate. For example, joining is usually much simpler and quicker in SQL than its non-relational equivalent. You could set up a basic Solr instance, recreate some of your text related SQL queries in Solr SQL equivalents, and see how it performs on your data in comparison.
While overall, NoSQL is usually touted as better at scaling, it's highly dependent on your data and requirements as to which of the two would be better in certain situations.

Double index within a noSQL database

I am working on creating a database to store three things. Let's say Experiment, Measure, metadata. The metadata is composed of a set of variable number and type of attributes, thus making the choice of a NoSQL attractive.
I need two simple queries over the database:
1) Give me the metadata of all the experiments with a given value of Measure.
2) Give me the metadata of all the measures for a Experiment.
And my main requirements are:
1) Tons of data. Each Experiment can come with millions of possible measures (and of course the metadata), and I expect tenths of thousands of Experiments.
2) Concurrency. I would like to have fast concurrent read/write because at any given point in time I may be running 10-20 experiments, and they will want to write millions of measures at the same time.
I've tried MongoDB, but it is slow due to the write locks. I would like to have something faster. Additionally, it does not handle well one of my queries, as I basically need two indexes here. I am considering as an alternative Titan, just because it seems natural to think of experiments an measures as nodes, and connect them with edges. Hypertable seems another possibility if I can find a way of doing both queries fast.
There are so many noSQL databases out there that I may be missing the right one for my needs. Suggestions?
Have you looked into NewSQL databases that could fit your needs? I suggest that you take a closer look at Starcounter that is true ACID, no locks on the writes and supports indexing on basic properties as well as combined indexes.
I think a transactional database that is object oriented and memory centric would suit your demands. You can then have different Experiments and Measures that derives the same class and you can select to query each type as well as query the ineherited types separately.
If you do not have more than TB of data you do not need a big data database that you have looked into so far. They are really good at what they do, but I think you should look into the other spectrum of NoSQL databases. When using an in-memory (all writes secured on persistent storage media of course) database that is object oriented you get about 4 times compressions compared to relational databases, so the TB of data would often be enaugh.
It is really hard to find your way around in the jungle of databases today, so I understand the difficulty of finding something that fits your requirements. In your case - my 5 cents on a transactional NoSQL database that is true ACID and with SQL query support!

Which database is best to work with graphs and tree structured data?

I am planning to work with Dapper.NET for a family site.
A lot of tree like data will be present in the structure.
Which database provides the best queries to work with cyclic/acyclic tree relations?
I want to know the easiness & performance comparison of hierarchical queries.
ie. like CTE in SQL Server, Connect By/Start with in Oracle etc..
Is dapper be the best choice as a Micro ORM for these kind of tree structured data?
I need opinion in choosing the right database and right Micro ORM for this.
Sorry for my bad English.
My question still stands: How much data do you expect?
But apart from that it's not just the type of database you're choosing for your data it's also table structure. Hierarchy trees can be stored in various different ways depending on your needs.
Table structure
Particular structures may be very fast on traversal reads but slow on inserts/updates (i.e. nested sets), others (adjacency lists) the other way around. For a 99:1 read:write ratio (vast majority of today's applications read much more than write) I would likely choose a modified nested set structure that has left, right, depth and parent. This gives you best possibility for read scenarios.
Database type
Unless you're aiming at huge amounts of data I suggest you go with any of the SQL databases that you know best (MSSQL, MySQL, Oracle). But if your database will contain enormous number of hierarchy nodes then flirting with a specialised graph-oriented database may be a better option.
80 million nodes
If you'd be opting for a modified nested set solution (also using negative values, so number of updates on insert/update halves) you'd have hierarchy table having left. right, ID and ParentID columns that would result in approx 1.2 GB table. But that's your top estimation after at least two years of usage.
My suggestion
Go quick & go light - Don't overengineer by using best possible database to store your hierarchy if it turns out it's not needed after all. Therefore I'd suggest you use relational DB initially so you can get on the market quickly even though solution will start to struggle after some millions of records. But before your database starts to struggle (we're talking years here) you'll gain two things:
You'll see whether your product will take off in the first place (there's many genealogy services already) so you won't invest in learning new technology; Because you'd be using proven and supported technology would get you on the market quickly
If your product does succeed (and I genuinely hope it does) it will still give you enough time to learn a different storage solution and implement it; with proper code layers it shouldn't be to hard to switch storage later on when required

SSAS - should I use views or underlying tables?

I have a set of views set up in SQL Server which output exactly the results that I would like to include in a SQL Server Analysis Services cube, including the calculation of a number of dimensions (such as Age using DATEDIFF, business quarter using DATENAME etc.). What I would like to know is whether it makes sense to use these views as the data source for a cube, or whether I should use the underlying tables to reproduce the logic in SSAS. What are the implications of going either route?
My concerns are:
the datasets are massive, but we need quick access to the results, so I would like to have as much of the calculations that are done in the views persisted within the SSAS data warehouse
Again, because the datasets are massive I want the recalculation of any cubes to be a fast as possible
Many experts actually recommend using views in your data source view in SSAS. John Welch (Pragmatic Works, Microsoft MVP, Analysis Services Maestro) spoke on how he preferred using views in the DSV this year at SQL Rally Dallas. The reason being is that it creates a layer between the cube and the physical table.
Calculating columns in the view will take a little extra time and resources during cube processing. If processing time is ok, leave the computations in the view. If it's an issue, you can always add a persisted computed column directly to the fact table so that the calculation is done during the insert / update of the fact table. The disadvantage of this is that you'll have to physically store the columns in the fact table. The advantage is that they don't have to be computed every time the cube gets processed. These are the tradeoffs that you'll need to weigh to decide which way to go.
Just make sure you tune the source queries to be as efficient as possible. Views are fine for DSVs.
Views, always! The only advantage of using tables on the DSV is that it ill map your keys automatically :) which saves you 5 minutes of development time haha.
Also, by "use the underlying tables to reproduce the logic in SSAS" you mean creating calculated columns on your SSAS DSV? It is an option too, but I rather add the calculations to the views because, in case I have to uptade them, is MUCH easier (and less subject to failure) to re-deploy a view than to redeploy a full cube.

How to improve ESRI/ArcGIS database performance while maintaining normalization?

I work with databases containing spatial data. Most of these databases are in a proprietary format created by ESRI for use with their ArcGIS software. We store our data in a normalized data model within these geodatabases.
We have found that the performance of this database is quite slow when dealing with relationships (i.e. relating several thousand records to several thousand records can take several minutes).
Is there any way to improve performance without completely flattening/denormalizing the database or is this strictly limited by the database platform we are using?
There is only one way: measure. Try to obtain a query plan, and try to read it. Try to isolate a query from the logfile, edit it to an executable (non-parameterised) form, and submit it manually (in psql). Try to tune it, and see where it hurts.
Geometry joins can be costly in term of CPU, if many (big) polygons have to be joined, and if their bounding boxes have a big chance to overlap. In the extreme case, you'll have to do a preselection on other criteria (eg zipcode, if available) or maintain cache tables of matching records.
EDIT:
BTW: do you have statistics and autovacuum? IIRC, ESRI is still tied to postgres-8.3-something, where these were not run by default.
UPDATE 2014-12-11
ESRI does not interfere with non-gis stuff. It is perfectly Ok to add PK/FK relations or additional indexes to your schema. The DBMS will pick them up if appropiate. And ESRI will ignore them. (ESRI only uses its own meta-catalogs, ignoring the system catalogs)
When I had to deal with spatial data, I tended to precalulate the values and store them. Yes that makes for a big table but it is much faster to query when you only do the complex calculation once on data entry. Data entry does take longer though. I was in a situation where all my spacial data came from a monthly load, so precalculating wasn't too bad.

Resources