Is a single table a bad starting point for OLAP cubes (SQL Server Analysis Services)? - sql-server

I'm going to use a single table to aggregate historical data about our (very big) virtual infrastructure. The table will be composed of 15 to 30 fields, and I esitmate from 500 to 1000 records a day.
Why a single table? A couple of reasons:
Data is extracted to csv using powershell scripts. Then bulk load on a single table is very easy and fast.
I will use the table to connect excel and report through pivot tables. Then a single table is perfect (otherwise I should create views).
Now my question:
If I'm planning in the future to build cubes upon this table is the "single-table" choice a bad solution?
Do cubes rely on relational databases or they can be easily built upon single-table databases?
Thanks for any suggestion

Can't tell you specifically about SQL Server Analysis Services, but for OLAP you typically use denormalized and aggregated data. That means fewer tables than in a normal relational scenario. And as your data volume is not really big (365k rows/year - even small for OLAP), I don't see any problem using a single table for your data.

Related

SQL Performance when using different tables or different databases

Currently, I have a SQL database with a lot of customers.
I have a table store data of all customers. Now I want to split data, each table will store data of each customer. Or each database will store data of each customer.
I'm confused about performance of SQL. Which way is best solution? Store on each table at the same database or store on each database?
As explained in comments both ways (that you describe) is not a good idea. If you want to increase performance of database then correct use indexes. Also you may look at partitioned table:
https://learn.microsoft.com/ru-ru/sql/relational-databases/partitions/partitioned-tables-and-indexes?view=sql-server-ver15
https://www.cathrinewilhelmsen.net/2015/04/12/table-partitioning-in-sql-server/

Extracting Data from SAP to SQL Server

I am using SSIS packages to extract data from SAP database tables into SQL Server tables. I am using OLEDB source/destination connections to achieve this.
The problem now is that a table in SAP has 5 Million records and its taking around 2 hours to extract this data into my SQL Server table. I have used the trunc-dump method (truncating the table in sql server and dumping data into it from SAP table) and also tried using Multiple Hash key to bring in the updated/new records.
The problem with Hash key is that it still has to scan the entire table to look for changed/new records and hence takes almost the same time as the trunc-dump method.
I am looking for a new way or changing the existing way to reduce the time taken to complete this extraction.
As you mentioned you were using OLEDB source connection to access SAP, if that means you were accessing SAP's underlying database directly, you should pause doing that for three reasons till there are explicit IT approvals:
You skipped SAP's application layer security. There can be an enterprise security compliance issue;
Your company's SAP license may not allow you to do that. If your company only has SAP indirect access license, then you may have to stay on application layer;
You will not get SAP's official support by accessing the underlying database directly.
You have multiple options to fetch data using SSIS through SAP application layer:
Use commercial SSIS custom components for this job (disclaimer: AecorSoft is one of the leading vendors offering such connectivity components);
Look into SAP's own OData Gateway interface to consume data.
Request your SAP ABAP team to write custom ABAP programs to dump SAP data into CSV files, and then use SSIS to fetch them.
Let's now look at the performance side:
SAP ETL Performance depends on many factors, but in general, even for the SAP transactional tables with 100+ columns, it's considered very slow to extract 5 millions rows per a couple of hours. For example, we've seen cases of extracting standard SAP General Ledger header table BKPF (almost 100 columns) at consistent performance of 1M rows every 1-2 minutes. Of course such performance is achieved through commercial component and SSIS, but you should expect at least 1M per 10 minutes even for the #3 option above, going through an intermediate CSV file. Under the hood, through SAP application layer, all the 3 options would leverage SAP Open SQL (in contrast to the "Native SQL" which the underlying database offers) to access SAP tables, therefore, if you experience application layer performance issue, you can analyze the Open SQL side.
As you also mentioned about update/new records scenario, it's a typical delta extraction problem. Normally, in SAP transactional tables, there are Create Date and Changed Date fields which can help you capture delta. In this case, in order to avoid full table scan, apply indices through SAP application layer on those "delta fields". For example, if you need to extract Sales Document Header VBAK table, you can filter by ERDAT (Created on) and AEDAT (Changed on). Delta is a complex subject in SAP. There is no simple statement to describe the delta solution, as SAP data models are complex and very different across functional modules. The delta analysis is always a case-by-case effort. Some people may also simply recommend using "delta extractors", but don't treat that as silver bullet, because extractor has its own problem. In short, if you look into table based extraction, focus on that, and try to work with your SAP functional team to determine the suitable delta fields. Try avoiding doing full table scan and hashing. Do incremental load with some optional overlap of previous extract (e.g. loading today and yesterday's records), and do MERGE to absorb the changes.
There are few cases you may not be able to find any delta field, and it is not practical to do full load all the time. One great example is the Address Master data table ADRC. In this case, if you are required to do delta load on such table, you ether have to request your SAP function team to figure out delta for you (meaning they inject custom logic to every place where Address master can be created, updated, or deleted), or you have to request your SAP Basis team to create DB trigger on the underlying database table, and expose the trigger table at application layer. This way, you can create an application layer view on the main table and the trigger table to do delta. Still, there is no direct database access through your solution. The DB layer trigger is fully managed and controlled by your SAP Basis team who also supports the database.
Hope this helps!

Rapidminer: Advantages of a database?

my question is, is it of any advantage to read data from a database ( like SQLServer, mongoDB, ...) into Rapidminer? instead like, for an excel table?
is it faster? or easier to read the data?
or is the only advantage that you have a database where you can store your data and not an advantage (e.g. for speed) for Rapidminer?
I wanted to either use sqlServer Express (2014 or 2016) or some open source database... its a dataset of now about 5000 records from one table, but there are at most 3 tables at the end, and the records are growing steadily over time...
can someone recommend me a database for it (maybe better open source?)
I want to use the 3 tables to make analysis of all 3 of them together in RapidMIner....

Column store in data warehouse

I have a question about data warehousing and column oriented databases. In my project the company use a warehouse solution in visual studio SQL server, they have troubles with the performance when querying complex questions on large amount of data. I want to try to replace the database with a columnar based database. I know that you can "transform" a row oriented database in to more column based or use an open source database such as Vertica or Sybase IQ, i just wondering how it would fit in the warehouse? Do you have to have a star join schema in a warehouse or can you use the columnar approach instead, i realize this is kind of a stupid question but im just trying to understand it all before i start to explore the different databases and solutions.
I know that SQL Server 2012 have a column store but i would like to try the other open source databases as well.
Thanks in advance!
Do you have to have a star join schema in a warehouse or can you use the columnar approach instead?
The star join schema consists of the table definitions of your data warehouse. The star schema, and similar schema, trade query performance for query flexibility. Usually, query flexibility is more important than query performance in a data warehouse.
Based on the Wikipedia article you linked to in your comments, a column oriented database engine stores the actual database bytes in column order, rather than the traditional row order of relational databases.
As the article says, this can improve disk access performance.
The star schema is how you define tables. A column oriented database engine is concerned with how the database information is written to disk. The two concepts have nothing to do with one another, except that they both apply to a data warehouse.
Keep your present data warehouse schema, and see if a column oriented database engine will improve query performance.

One Database with 20 million records or 51 databases with 50,000-300,000 records in each database?

I've bought a CSV United States business database with ~20 million records, which is divided to 51 databases, every database represents a state.
I need to write an ASP.NET MVC Web Application that will query this database, by state and more arguments. Should I create a SQL Server database and import all the records in the all 51 csv files? Or maybe should I query directly to the csv files? What will be fastest? Feel free to suggest and other solutions.
Thanks.
Create a single database, where you put all those records in. But, do it in a structured fashion offcourse.
For instance, you could create a table 'State', and a table called 'Business'. Create a relationship between those 2 tables.
Normalize your database further.
When you want to have a performant database, it starts by defining a good, normalized DB schema.
Add the necessary indexes, and you should be fine.
A database is designed to be able to handle a large amount of records.
One table, with appropriate indexes. 20 million records is peanuts.
I would import the data into one big database. As long as the table is correctly indexed it will offer better performance when querying as instead of having to scan each file it should be able to use the correct indexes to speed things up.

Resources