Microsoft SQL Server Analysis Services OLAP cube - sql-server

I was trying to find a tool to increase performance in the reports of our application and I heard about OLAP + Reporting Services which is described as an excellent combination to do this work. Anyway I didn't find the way to keep the OLAP cube up-to-date since the data in the original DB can change. (It's a transactional application and one pending record can be mark as paid etc).
Is this the better way to do this, or should I use another technology?
If the suggestion is still to use OLAP + Reporting services how can I have the information up-to-date?

I have never used them but I've heard that astrology + fortune telling are extremely cheaper, faster, more efficient, make magics and require to provide even less input than you did in this question.
"Anyway I didn't find the way to keep the OLAP cube up-to-date since the data in the original DB can change."
It is called ROLAP storage mode

It is usual for a OLAP database to be populated on a regular schedule from your OLTP database using some form of ETL (Extract, Transform, Load).
In the SQL Server world, that is often accomplished using SSIS.
I suggest you read these books:
The Data Warehouse Toolkit
The Data Warehouse ETL Toolkit

Related

Loading data from SQL Server to Elasticsearch

Looking for suggesting on loading data from SQL Server into Elasticsearch or any other data store. The goal is to have transactional data available in real time for Reporting.
We currently use a 3rd party tool, in addition to SSRS, for data analytics. The data transfer is done using daily batch jobs and as a result, there is a 24 hour data latency.
We are looking to build something out that would allow for more real time availability of the data, similar to SSRS, for our Clients to report on. We need to ensure that this does not have an impact on our SQL Server database.
My initial thought was to do a full dump of the data, during the weekend, and writes, in real time, during weekdays.
Thanks.
ElasticSearch's main use cases are for providing search type capabilities on top of unstructured large text based data. For example, if you were ingesting large batches of emails into your data store every day, ElasticSearch is a good tool to parse out pieces of those emails based on rules you setup with it to enable searching (and to some degree querying) capability of those email messages.
If your data is already in SQL Server, it sounds like it's structured already and therefore there's not much gained from ElasticSearch in terms of reportability and availability. Rather you'd likely be introducing extra complexity to your data workflow.
If you have structured data in SQL Server already, and you are experiencing issues with reporting directly off of it, you should look to building a data warehouse instead to handle your reporting. SQL Server comes with a number of features out of the box to help you replicate your data for this very purpose. The three main features to accomplish this that you could look into are AlwaysOn Availability Groups, Replication, or SSIS.
Each option above (in addition to other out-of-the-box features of SQL Server) have different pros and drawbacks. For example, AlwaysOn Availability Groups are very easy to setup and offer the ability to automatically failover if your main server had an outage, but they clone the entire database to a replica. Replication let's you more granularly choose to only copy specific Tables and Views, but then you can't as easily failover if your main server has an outage. So you should read up on all three options and understand their differences.
Additionally, if you're having specific performance problems trying to report off of the main database, you may want to dig into the root cause of those problems first before looking into replicating your data as a solution for reporting (although it's a fairly common solution). You may find that a simple architectural change like using a columnstore index on the correct Table will improve your reporting capabilities immensely.
I've been down both pathways of implementing ElasticSearch and a data warehouse using all three of the main data synchronization features above, for structured data and unstructured large text data, and have experienced the proper use cases for both. One data warehouse I've managed in the past had Tables with billions of rows in it (each Table terabytes big), and it was highly performant for reporting off of on fairly modest hardware in AWS (we weren't even using Redshift).

Can we use snowflake as database for Data driven web application?

I am Asp.Net MVC/SQLSERVER developer and I am very new to all these and so I may be on compelete wrong path.
I came to know by googling that Snowwflake can put/get data from AWS-S3, Google Storage and Azure. And Snowflake has their database and tables as well.
I have following questions,
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
Can we use Snowflake as database for data driven web application? and if yes, could you provide link or something to start?
Once again I am very new to all these and expecting from you to get ideas and best way to work arround this.
Thak you in advance.
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
You're talking about three different classes of technology product there, which are not equivalent:
Snowflake is a database platform, similar to other database technologies it provides data storage and metadata and a SQL interface for data manipulation and management.
AWS S3 (and similar products) provides scalable cloud storage for files of any kind. You generally need to implement an additional technology such as Spark, Presto, or Amazon Athena to query data stored as files in cloud storage. Snowflake can also make use of data files in cloud storage, either querying the files directly as an "external table" or using a COPY statement to load the data into Snowflake itself.
Talend and other ETL or data integration tools are used to move data between source and target platforms. Usually this will be from a line of business application, such as an ERP system, to a data warehouse or data lake.
So you need to think about three things when considering Snowflake:
Where is your analytical data going to be stored? Is it going to be files in cloud storage, loaded into a database or a mix of both? There are advantages and disadvantages to each scenario.
How do you want to query the data? It's fairly likely you'll want something that supports the use of SQL queries, as mentioned above there are numerous technologies that support SQL on files in cloud storage. Query performance will generally be significantly better if the data is loaded into a dedicated analytical database though.
How will the data get from the data sources to the analytical data repository, whatever that may be? Typically this will involve either a third party ETL tool, or rolling your own solution (which can be a cheaper option initially but can become a significant management and support overhead).
Can we use Snowflake as database for data driven web application?
The answer to that is yes, in theory. It very much depends on what your web application does because Snowflake is a database designed for analytics, i.e. crunching through large amounts of data to find answers to questions. It's not designed as a transactional database for a system that involves lots of updates and inserts of small amounts of data. For example Snowflake doesn't support features like referential integrity.
However, if your web application is an analytical one (for example it has embedded reports that would query a large amount of data and users will typically be reading data and not adding it) then you could use Snowflake as a backend for the analytical part, although you would probably still want a traditional database to manage data for things like users and sessions.
You can connect your web application to Snowflake with one of the connectors, like https://docs.snowflake.com/en/user-guide/odbc.html
Snowflake excels for large analytic workloads that are difficult to scale and tune. If, for example, you have many (millions/billions) of events that you want to aggregate into dashboards, then Snowflake might be a good fit.
I agree with much of what Nathan said, to add to that, from my experience every time I've created a database for an application it's been with an OLTP database like PostgreSQL, Azure SQL Database, or SQL Server.
One big problem of using MPP/Distributed Databases is that they don't enforce referential integrity, so if that's important to you then you don't want to use MPP/Distributed Databases.
Snowflake and other MPP/Distributed Databases are NOT meant for OLTP workloads but instead for OLAP workloads. No matter what snake oil those companies like databricks and snowflake try to sell you MPP/Distributed databases are NOT meant for OLTP. The costs alone would be tremendous even with auto-scaling.
If you think about it, Databricks, Snowflake, etc. have a limit to how much they want to optimize their platforms because the longer a query runs the more money they make. For them to make money they have to optimize performance but not too much otherwise it will effect their income.
This can be an in-depth topic so I would recommend doing more research into OLTP Vs. OLAP.
Enforcing Referential integrity is a double edged sword, the downside being as the data volume grows the referential violation check significantly slows down the inserts and deletes. This results in the developer having to put the RI check in the program (with a dirty read) and turn off the RI enforcement by the database, finally ending up with a Snowflake like situation.
Bottom line is Snowflake not enforcing RI should not be a limitation for OLTP applications.

Which gives better performance for connection with Qlikview??SQL server or SSAS Cube?

I want to process my data into Qlikview but i am confused about to process the data through Cube or directly from SQL.
Can anyone tell me which gives better performance from cube and SQL?
Note: I have millions of data into the database.
Generally as the volume of data grows, the advantages of SSAS tend to become more apparent than those from using SQL Server as the source. How will the data be used? When it comes to large scale aggregations SSAS becomes very beneficial. SSAS will also force a structured layout, as the relationships are predefined in the cube as opposed to joins. Some additional features that SSAS brings are hierarchical analysis (hierarchies) as well as ease of use with tools such as Excel and SSRS, although it sounds like you're only looking to use Qlikview for this. However, your best option would be to do a baseline for both SSAS and SQL Server in your environment with queries that best represent what would be run when this is implemented, and assess the results from there.
From BI tool perspective it doesn't matter as you can connect to both source (SQL is more common but it depends on your expertise). Regarding performance the best strategy is to have separate extract layer and store data incrementally as qvd (for example every night previous day) so performance is not as important with incremental reload as even for big data sets it should be quick.
If your original source of data is SQL in my opinion it doesn't make sense to replicate data in 3 places (SQL, cube and QlikView) better connect directly to source save it incrementally raw data as qvd and then have transformer which will model that data.

Extract & transform data from Sql Server to MongoDB periodically

I have a Sql Server database which is used to store data coming from a lot of different sources (writers).
I need to provide users with some aggregated data, however in Sql Server this data is stored in several different tables and querying it is too slow ( 5 tables join with several million rows in each table, one-to-many ).
I'm currently thinking that the best way is to extract data, transform it and store it in a separate database (let's say MongoDB, since it will be used only for read).
I don't need the data to be live, just not older that 24 hours compared to the 'master' database.
But what's the best way to achieve this? Can you recommend any tools for it (preferably free) or is it better to write your own piece of software and schedule it to run periodically?
I recommend respecting the NIH principle here, reading and transforming data is a well understood exercise. There are several free ETL tools available, with different approaches and focus. Pentaho (ex Kettle) and Talend are UI based examples. There are other ETL frameworks like Rhino ETL that merely hand you a set of tools to write your transformations in code. Which one you prefer depends on your knowledge and, unsurprisingly, preference. If you are not a developer, I suggest using one of the UI based tools. I have used Pentaho ETL in a number of smaller data warehousing scenarios, it can be scheduled by using operating system tools (cron on linux, task scheduler on windows). More complex scenarios can make use of the Pentaho PDI repository server, which allows central storage and scheduling of your jobs and transformations. It has connectors for several database types, including MS SQL Server. I haven't used Talend myself, but I've heard good things about it and it should be on your list too.
The main advantage of sticking with a standard tool is that once your demands grow, you'll already have the tools to deal with them. You may be able to solve your current problem with a small script that executes a complex select and inserts the results into your target database. But experience shows those demands seldom stay the same for long, and once you have to incorporate additional databases or maybe even some information in text files, your scripts become less and less maintainable, until you finally give in and redo your work in a standard toolset designed for the job.

When to build a separate reporting database?

We're building an application that has a database (yeah, pretty exciting huh :). The database is mainly transactional (to support the app) and also does a bit of "reporting" as part of the app - but nothing too strenuous.
Above and beyond that we have some reporting requirements - but they're pretty vague and high-level at the moment. We have a standard reporting tool that we-use in-house which we'll use to do the "heavier" reporting as the requirements solidify.
My question is: how do you know when a separate database for reporting is required?
What sort of questions need to be asked? What sort of things would make you decide a separate reporting database was necessary?
In general, the more mission critical the transactional app and the more sophisticated the reporting requirements, the more splitting makes sense.
When transaction performance is critical.
When it's hard to get a maintenance window on the transactional app.
If reporting needs to correlate results not only from this app, but from other application silos.
If the reports need to support trending or other types of reporting that are best suited for a star schema/Business Intelligence environment.
If the reports are long running.
If the transactional app is on an expensive hardware resource (cluster, mainframe, etc.)
If you need to do data cleansing/extract-transform-load operations on the transactional data (e.g., state names to canonical state abbreviations).
It adds non-trivial complexity, so imo, there has to be a good reason to split.
Typically, I would try to report off the transactional database initially.
Ensure that any indexes you add to facilitate efficient reporting are all frequently used. The more indexes you add, the poorer performance is going to be on inserts and (if you alter keys) updates.
When you do go to a reporting database, remember there are only a few reasons you are going there:
Ultimately, the number one thing about reporting databases is that you are removing locking contention from the OLTP database. So if your reporting database is a straight copy of the same database, you're simply using delayed snapshots which won't interfere with production transactions.
Next, you can have a separate indexing strategy to support the reporting usage scenarios. These extra indexes are OK to maintain in the reporting database, but would cause unnecessary overhead in the OLTP database.
Now both the above could be done on the same server (even the same instance in a separate database or even just in a separate schema) and still see benefits. When CPU and IO are completely pegged, at that point, you definitely need to have it on a completely separate box (or upgrade your single box).
Finally, for ultimate reporting flexibility, you denormalize the data (usually into a dimensional model or star schemas) so that the reporting database is the same data in a different model. Reporting of large amounts of data (particularly aggregates) is extremely fast in dimensional models because the star schemas are very efficient for that. It also is efficient for a larger variety of queries without a lot of re-indexing or analysis to change indexes, because the dimensional model lends itself better to unforeseen usage patterns (the old "slice and dice every which way" request). You could view this is a kind of mini-data warehouse where you use data warehousing techniques, but aren't necessarily implementing a full-blown data warehouse. Also, star schemas are particular easy for users to get to grips with, and data dictionaries are much simpler and easier to build for BI tools or reporting tools from star schemas. You could do this on the same box or different box etc, just like discussed earlier.
This question requires experience rather than science.
As a BI architect, the approach I take on designing each BI solution for my clients are very different. I don't go through a checklist. It requires a general understanding of their system, their reporting requirements, budget and man power.
I personally prefer to keep the reporting processes as much as possible on the database side (Best practice in BI world). REPORTING TOOLS ARE FOR DISPLAYING PURPOSE ONLY (MAXIMUM FOR SMALL CALCULATIONS). This approach requires a lot of pre-processing of data which requires different staging tables, triggers and etc.
When you said:
I work on projects with hundreds of millions of rows with real time reporting along with hundreds of users accessing the application/database at the same time with out issue.
There are a few things wrong with your statement.
Hundreds of millions of rows are A LOT. even today's in memory tools like Cognos TM1 or Qlikview would struggle to get such a results. (look at SAP HANA from SAP to understand how giants in the industry handle it).
If you have Hundreds of millions of rows in database, it doesn't necessarily mean that the report needs to go through all those records. maybe the report worked on 1000s not millions. probably that's what you saw.
Transactional reports are very different than dashboards. Most dashboard tools pre-processing and cache the data.
My point is that it all comes to experience for deciding when to:
design a new schema
create a semantic database
work on the same transactional database
or even use a reporting tool (Sometimes handwritten dashboards with Java/JSF/Ajax/jQuery or JSP would work fine for client)
The main reason you would need a separate database for reporting issues is when the generation of the reports interferes with the transactional responsibilities of the app. E.g. if a report takes 20 minutes to generate and utilizes 100% of the CPU/Disk/etc... during a time of high activity you might think of using a separate database for reporting.
As for questions, here are some basic one:
Can I do the high intensity reports during non-peak hours?
Does it interfere with the users using the system?
If yes to #2, what are the costs of the interference Vs the cost of another database server, refactoring code, etc...?
I would also add another reason for which you might use a reporting database, and that is: CQRS pattern (Command Query Responsibility Separation).
If you have a large number of users accessing and writing to a small set of data, you would do wise to consider this pattern. It basicly, in its simplest form, means that all your commands (Create, Update, Delete) are pushed to the transactional database.
All of your queries (Read) are from your reporting database. This lets you freely scopy your architecture and upgrade function.
There are MUCH more to it in the pattern, I just mentioned the bit which was interesting due to your question regarding reporting database.
Basically, when the database load from the app becomes incompatible with the database load for reporting. This could be due to:
Reporting consuming inordinate amount of database server resources impacting the app's DB performance.
A part of this category would be the app DB work having to wait on a majorly slow report query due to locking, though it might be possible to resolve with less drastic methods like locking tuning.
Reporting queries being very incompatible with app queries as far as tuning (e.g. indices but not limited to that) - the most dumb example would be something like a hot spot affecting app inserts because of the reporting-purpose index.
Timing issues. E.g. the only small windows for DB maintenance available (due to application usage) are the times of heavy reporting work
Reporting data's sheer volume (e.g. logging, auditing, statistics) is so big that your primary DB server architecture is a bad solution for such reporting (see Sybase ASE vs. Sybase IQ). BTW, this is a real scenario - we moved our performance reporting to IQ because of this.
I would also add that transactional databases are meant to hold current state and oftentimes do so to be self-maintaining. You don't want transactional databases growing beyond their necessary means. When a workflow or transaction is complete then move that data out and into a Reporting database, which is much better designed to hold historical data.

Resources