Best Options to manage large sets of data SQlserver - sql-server

I am currently working on a project which involves the following:
The application I am working on is connected to a SQlserver
database.
SAP loads information into multiple tables (in a daily
and also hourly basis) into a MASTER database
There are 5 other databases(hosted on the same server) that access this information via synonyms and stored procedure calls to the MASTER database
The MASTER database purely used for storing the data and routing it to the other databases)
Master Database -
Tables:
MASTER_TABLE1 <------- SAP inserts data into this table.Triggers are used to process the valid data & insert into secondary staging tables -say MASTER_TABLE1_SEC
MASTER_TABLE1_SEC -- Holds processed data coming into MASTER_TABLE1
FIVE other databases ( for each manufacturing facility) are present in the same server. My application is connected to the facility databases ( not the Master)
FACILITY1
Facility2
....
FACILITY5
Synonyms of MASTER_TABLE1_SEC are created in each of these 5 facility databases
Stored procedures are again called from the Facility databases- in order to load data from the MASTER_TABLE1_SEC into the respective tables( within EACH facility) based on the business logic.
Is there a better architecture to handle this kind of a project? I am a beginner when it comes to advanced data management. Can anyone suggest a better architecture or tools to handle this?

There are a lot of patterns that would actually meet the needs described here. It serves that you are working with a type of Data Warehouse. I use Data Vault for my Enterprise Data Warehouses. It is an Ensemble Modeling technique designed for integration and master data preparation. You can think of it as a way to house all data from all time. You would then generate Data Marts (Kimball Method) for each of the Facilities containing only thei or whatever is required for their needs.

Related

Create a Data Warehouse with the database on SQL Developer

I have a database in SQL Developer which pull data from an ERP tool and I would like to create a Data warehouse in order to connect it then to PowerBI.
It's my first time that I am doing all this process from the beginning so I am not so experienced.
So where are you suggesting to create the Data Warehouse (I was thinking on SSMS) and how can I connect it with PowerBI ?
My Data Warehouse will consist from some View of my tables and some Joins to get some data in the structure that I want since it is not possible to change anything in the DB.
Thanks in advance.
A "data warehouse" is just a database. The distinction is really more about the commonly used schema design, in the sense that a warehouse is often built along the lines of a star or snowflake design.
So if you already have a database that is extracting data from your ERP, there is nothing to stop you from pointing PowerBI directly at that and performing some analytics etc. If your intention is to start with this database, and then clone/extract/load this data into a new database which is a star/snowflake schema, then that's a much bigger exercises.

Extracting Data from SAP to SQL Server

I am using SSIS packages to extract data from SAP database tables into SQL Server tables. I am using OLEDB source/destination connections to achieve this.
The problem now is that a table in SAP has 5 Million records and its taking around 2 hours to extract this data into my SQL Server table. I have used the trunc-dump method (truncating the table in sql server and dumping data into it from SAP table) and also tried using Multiple Hash key to bring in the updated/new records.
The problem with Hash key is that it still has to scan the entire table to look for changed/new records and hence takes almost the same time as the trunc-dump method.
I am looking for a new way or changing the existing way to reduce the time taken to complete this extraction.
As you mentioned you were using OLEDB source connection to access SAP, if that means you were accessing SAP's underlying database directly, you should pause doing that for three reasons till there are explicit IT approvals:
You skipped SAP's application layer security. There can be an enterprise security compliance issue;
Your company's SAP license may not allow you to do that. If your company only has SAP indirect access license, then you may have to stay on application layer;
You will not get SAP's official support by accessing the underlying database directly.
You have multiple options to fetch data using SSIS through SAP application layer:
Use commercial SSIS custom components for this job (disclaimer: AecorSoft is one of the leading vendors offering such connectivity components);
Look into SAP's own OData Gateway interface to consume data.
Request your SAP ABAP team to write custom ABAP programs to dump SAP data into CSV files, and then use SSIS to fetch them.
Let's now look at the performance side:
SAP ETL Performance depends on many factors, but in general, even for the SAP transactional tables with 100+ columns, it's considered very slow to extract 5 millions rows per a couple of hours. For example, we've seen cases of extracting standard SAP General Ledger header table BKPF (almost 100 columns) at consistent performance of 1M rows every 1-2 minutes. Of course such performance is achieved through commercial component and SSIS, but you should expect at least 1M per 10 minutes even for the #3 option above, going through an intermediate CSV file. Under the hood, through SAP application layer, all the 3 options would leverage SAP Open SQL (in contrast to the "Native SQL" which the underlying database offers) to access SAP tables, therefore, if you experience application layer performance issue, you can analyze the Open SQL side.
As you also mentioned about update/new records scenario, it's a typical delta extraction problem. Normally, in SAP transactional tables, there are Create Date and Changed Date fields which can help you capture delta. In this case, in order to avoid full table scan, apply indices through SAP application layer on those "delta fields". For example, if you need to extract Sales Document Header VBAK table, you can filter by ERDAT (Created on) and AEDAT (Changed on). Delta is a complex subject in SAP. There is no simple statement to describe the delta solution, as SAP data models are complex and very different across functional modules. The delta analysis is always a case-by-case effort. Some people may also simply recommend using "delta extractors", but don't treat that as silver bullet, because extractor has its own problem. In short, if you look into table based extraction, focus on that, and try to work with your SAP functional team to determine the suitable delta fields. Try avoiding doing full table scan and hashing. Do incremental load with some optional overlap of previous extract (e.g. loading today and yesterday's records), and do MERGE to absorb the changes.
There are few cases you may not be able to find any delta field, and it is not practical to do full load all the time. One great example is the Address Master data table ADRC. In this case, if you are required to do delta load on such table, you ether have to request your SAP function team to figure out delta for you (meaning they inject custom logic to every place where Address master can be created, updated, or deleted), or you have to request your SAP Basis team to create DB trigger on the underlying database table, and expose the trigger table at application layer. This way, you can create an application layer view on the main table and the trigger table to do delta. Still, there is no direct database access through your solution. The DB layer trigger is fully managed and controlled by your SAP Basis team who also supports the database.
Hope this helps!

AWS Glue: SQL Server multiple partitioned databases ETL into Redshift

Our team is trying to create an ETL into Redshift to be our data warehouse for some reporting. We are using Microsoft SQL Server and have partitioned out our database into 40+ datasources. We are looking for a way to be able to pipe the data from all of these identical data sources into 1 Redshift DB.
Looking at AWS Glue it doesn't seem possible to achieve this. Since they open up the job script to be edited by developers, I was wondering if anyone else has had experience with looping through multiple databases and transfering the same table into a single data warehouse. We are trying to prevent ourselves from having to create a job for each database... Unless we can programmatically loop through and create multiple jobs for each database.
We've taken a look at DMS as well, which is helpful for getting the schema and current data over to redshift, but it doesn't seem like it would work for the multiple partitioned datasource issue as well.
This sounds like an excellent use-case for Matillion ETL for Redshift.
(Full disclosure: I am the product manager for Matillion ETL for Redshift)
Matillion is an ELT tool - it will Extract data from your (numerous) SQL server databases and Load them, via an efficient Redshift COPY, into some staging tables (which can be stored inside Redshift in the usual way, or can be held on S3 and accessed from Redshift via Spectrum). From there you can add Transformation jobs to clean/filter/join (and much more!) into nice queryable star-schemas for your reporting users.
If the table schemas on your 40+ databases are very similar (your question doesn't clarify how you are breaking your data down into those servers - horizontal or vertical) you can parameterise the connection details in your jobs and use iteration to run them over each source database, either serially or with a level of parallelism.
Pushing down transformations to Redshift works nicely because all of those transformation queries can utilize the power of a massively parallel, scalable compute architecture. Workload Management configuration can be used to ensure ETL and User queries can happen concurrently.
Also, you may have other sources of data you want to mash-up inside your Redshift cluster, and Matillion supports many more - see https://www.matillion.com/etl-for-redshift/integrations/.
You can use AWS DMS for this.
Steps:
set up and configure DMS instance
set up target endpoint for redshift
set up source endpoints for each sql server instance see
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.SQLServer.html
set up a task for each sql server source, you can specify the tables
to copy/synchronise and you can use a transformation to specify
which schema name(s) on redshift you want to write to.
You will then have all of the data in identical schemas on redshift.
If you want to query all those together, you can do that by wither running some transformation code inside redsshift to combine and make new tables. Or you may be able to use views.

Expressing data transformation

Two different relational databases.
Your task is to write a code to transfer the data from the first database to the second database.
Some tables in the database you are transferring to are of the same structure as the table you are transferring from, the transfer of these tables is as simple as "INSERT INTO DbA.TableA (...) VALUES SELECT * FROM DbB.TableB".
Some tables in the database you are transferring to have different structures and different purposes. After proper analysis, you understand the relations and you understand the right transformation you need to code.
My question is: how do you express such knowledge? How do you express the transformational relations between two databases? Are there any tools or diagrams?
The best way I know right now is writting the list of tables of the first database and for each table describing how it is to be transformed into the second database. Is it possible to make this more formal/concise/cool?
If you are wanting a toolset and work in the Microsoft database stack then this is exactly what SQL Server Integration Services (or SSIS) is used for.
If you are wanting to document the process then you would typically write an interface definition document (IDD). There are many examples on Google but here is something to get you started.

Difference between Database and Data Source

What is the difference between Database and Data Source?
A data source is simply something your program relies on to get data. A database is a kind of data source that persists data to some digitized form. Other data sources include files, services, etc — these all provide data to your programs.
Further to BoltClock's answer, here are example of Databases (or Database Servers) vs various Data Sources.
Databases
SQL
Oracle
MySQL
Data Sources
All of the databases above
XML Files
CSV Files
Web Services
and many many more..
Data Source may not be connected to DB, it can be just file system or any source of data.
To quote the description from Techopedia:
A data source, in the context of computer science and computer
applications, is the location where data that is being used come from.
In a database management system, the primary data source is the
database, which can be located in a disk or a remote server. The data
source for a computer program can be a file, a data sheet, a
spreadsheet, an XML file or even hard-coded data within the program.
In a simple word, I will try to answer this question.
First, we need to understand data sources, Data source is something from where we get data to analyze or a place where data is stored. Kinds of different data sources are:
1. Databases
2. Flat files, Excel sheets, Spread Sheets
3. Web Services, Etc.
Now come to the database as in the above examples we can see there is mentioned database, which is one of the kinds of data sources. In databases companies stores their collections of records, responses, survey, etc. In databases there are two types:
1. DBMS: Database management system
2. RDBMS: Relational Database Management system
Data --- image, video, file , pdf,msg, name, age, height, weight, etc
Database is a collection of data (contact number, best friend names, shopping list is a collection of data) stored in a formet Systematically that can be easily accessed.
Example of database: attendance register
Attendance register of employees in office
Attendance register of students in school
Attandance register is a database
Database store in computer , mobile , tables, excel, folders etc
Types of database: Network Database, Object-Oriented Database, Relational Database, Hierarchical Database like attendance register, attendance page, attendance darry.
Database Management System (DBMS)------- A database management system (DBMS) is a software package create, manipulate, retrieve and manage data in a database .A DBMS generally manipulates the data itself, the data format, field names, record structure and file structure. It also defines rules to validate and manipulate this data. example of DBMS---> HR, Teacher.
Hr, Teacher maintain register same as Dbms maintain database ...
________database create ________
| |
MANUALLY Electrically
(Hands) ( computer,mobilephones etc)
using pen, paper using DBMS software, File system etc
A DATABASE store the data and provide a method to access it, a DBMS actually converts the queries into a meaningful command, to invoke the method used to access the database.
Some other DBMS (teacher, HR) examples include:
MySQL ( example ->eng teacher)
SQL Server ( hindi teacher)
Oracle ( evs teacher)
dBASE ( senior teacher)
FoxPro ( math teacher)
All teacher maintain attendance register same as all this DBMS maintain database..
Principal or head decide which teacher is create and maintain register same as developer decide which DBMS( my sql, oracle) is create and maintain database and which is best.
Find rohan total attendance?
Find rohan total attendance --------> HR/Teacher -----------> Attandance register
Find rohan total attendance------> DBMS(my sql) --------->database
SQL (Structured query language)
NOSQL
SQL--------- SQL stands for Structured Query Language. SQL is used to communicate with a database
it is the standard language for relational database management systems. SQL statements are used to perform tasks such as update data on a
database, or retrieve data from a database. Some common DBMS( Oracle, Sybase, Microsoft SQL Server, Access, Ingres, My sql, Oracle,Sqlite) that use
Sql . Sql database is tablebase database like Ms excel, vertically data store, relational database.
In NOSQL data is store in key value, pair like json...NOSQL used by---Redis , MongoDB. example ofsql , sql is class monitor or head student of class
that is help to the DBMS( Teacher, Hr) to manage Database (Register)... one class monitor helps all teacher same as one SQL Used by many dbms
(My sql , oracle )
Purpose .. To query and operate database system.
SQL use by NOSQL use by
1.My sql 1.Redis
2.Ms-sql 2.MongoDB etc
3. oracle etc
| |
|_______________DBMS__________________|

Resources