handling performance issues in SSIS - sql-server

I have billions of records to transmit via SSMS and looking at improving the migration speed. I am trying to save the resultset into a table. I am taking into account my front end would be slow since all that data would be in one table. I am basically looking at various options. So I am thinking of having multiple physical tables. I just need last 5 years of data. Will it be faster If I execute 5 different versions of the same stored procedures with different year filters and populate the tables . I know one can achieve parallelism in SSIS. My only fear is since all the five storedprocedures are running in parallel , will they lock down the tables

You could look into table partitioning schemes in sql server. It seems like the Year column would be a good field to use in your partitioning function.
Table Partitioning in SQL Server

Related

SQL Server Complex Add/Update merge million rows daily

I have 7 reports which are downloaded daily at late night.
These reports can be downloaded in csv/xml. I am downloading them csv format as they are memory efficient.
This process runs in background and is managed by hangfire.
After they are downloaded, I am using dapper to run a stored procedure which insert/update/update data using merge statements. This stored procedure has seven table value parameters.
Instead of delete, I am updating that record's IsActive column to false.
Note that 2 reports have more than 1 million records.
I am getting timeout exceptions only in Azure SQL. In SQL Server, it works fine. As a workaround, I have increased timeouts to 1000 for this query.
This app is running in Azure s2.
I have pondered over the option of sending xml but I have found SQL Server is slow at processing xml which counter productive.
I can not also use SqlBulkCopy as I have to update based on some conditions.
Also note that more reports will be added in future.
Also when a new report is added then there are large amount of inserts. If previously added report is ran again then majority updates are run.
These tables currently do not have any indexes, only clustered integer primary key.
Each row has a unique code. This code is used to identify whether to insert/update/delete
Can you recommend a way to increase performance?
Is your source inputting the whole data? Whether they are updated/new. I assume by saying the unique code(insert/update/delete) you are only considering changes (Delta). If not that's one area. Another is to consider parallelism. I think then you need to have different stored procedures for each table. Non dependent tables could be processed together

Can I use Hadoop to speed up a slow SQL stored procedure?

The problem:
I have 2 SQL Server databases from 2 different applications. They describe different aspects of industrial machines: one is about "how many consumables were spent per order", the other is about "how many good/bad production items were produced per operator". Sometimes many operators are working on 1 order one after another, sometimes one operator is working on multiple small orders, and there is no connection Order-Operator in the database.
I want to have united fact table, where for every timestamp I know MachineID, OrderID and OperatorID. If a timestamp exists in DB1, then the record will have numeric measures from it (Consumables); if it exists in DB2, then it will have numeric measures from DB2 (good/bad production items). If it exists in both databases, then it have all numeric measures. A simple UNION ALL is not enough, because I want to have MachineID, OrderID and OperatorID for every record.
I created a T-SQL stored procedure to make FULL JOIN by timestamp and MachineID. But on large data sets (multiple machines, multiple customers) it becomes very slow. Both applications support editing history, so I need to merge full history from both databases at every nightly load.
To speed up the process, I would like to put calculations into multiple parallel threads, separated by Customer, MachineID, and Year.
I tried to do it by using SQL Server stored procedures, running in parallel by SQL Agent with different parameters, but I found that it didn't help the performance. Instead it created multiple deadlocks when updating staging and final tables.
I am looking for an alternative way to resolve this problem, but I don't know what is the right tool. Can Hadoop or similar parallel processing tool help with this task?
I am looking for solution with minimal cost, because it is needed for just one specific task. For everything else, SQL Server and PowerBI reporting are working just fine for me.
Hadoop seems hard to justify in this use case, given limited scope. The thing about Hadoop is that it scales well not only due to parallel processing but thanks to parallel IO, when data is distributed across multiple servers/storage media. Unless you happy to copy all data to HDFS distributed among multiple nodes, it likely will not help much. If you want to spin up a Hadoop cluster and run multiple jobs querying single SQL server, it'll likely end up badly for the later.
Have you considered optimizations which will allow you to limit the amount of data you processing nightly?
E.g. what is 'timestamp' field? Does it reflect last update time? Can you use it to filter rows which haven't been updated since the previous run?
Even if the 'timestamp' is not the time of last updates, can you add an "updateTime" field and triggers on updates which will populate the field, so you don't need to import rows which have not changed since the previous run? If you build an index on the field, then, if the number of updates during the day is not high relative to total table size, a query with a filter on such field will hit the index, and fetching of incremental changes should be fast.
Another thing to consider - are those DBs running on the same node/SQL server? Access to remote DBs is slow, so if that's the case, think about how to fix this first.

Index/Statistics on volatile tables

One of my application has the following use-case:
user inputs some filters and conditions about orders (delivery date ranges,...) to analyze
the application compute a lot of data and save it on several support tables (potentially thousands of record for each analysis)
the application starts a report engine that use data from these tables
when exiting, the application deletes computed record from support tables
Actually I'm analyzing how to ehnance queries performance adding indexes/stastics to support tables and the SQL Profiler suggests me to create 3-4 indexes and 20-25 statistics.
The record in supports tables are costantly created and removed: it's correct to create all this indexes/statistics or there is the risk that all these data will be easily outdated (with the only result of a costant overhead for maintaining indexes/statistics)?
DB server: SQL Server 2005+
App language: C# .NET
Thanks in advance for any hints/suggestions!
First seems like a good situation for a data cube. Second, yes you should update stats before running your query once the support tables are populated. You should disable your indexes when inserting the data. Then the rebuild command will bring your indexes and stats up to date in one go. Profiler these days is usually quite good at these suggestions, but test the combinations to see what actully gives the best performance gains. To look as os cubes here What are the open source tools and techniques to build a complete data warehouse platform?

Warehouse PostgreSQL database architecture recommendation

Background:
I am developing an application that allows users to generate lots of different reports. The data is stored in PostgreSQL and has natural unique group key, so that the data with one group key is totally independent from the data with others group key. Reports are built only using 1 group key at a time, so all of the queries uses "WHERE groupKey = X;" clause. The data in PostgreSQL updates intensively via parallel processes which adds data into different groups, but I don't need a realtime report. The one update per 30 minutes is fine.
Problem:
There are about 4 gigs of data already and I found that some reports takes significant time to generate (up to 15 seconds), because they need to query not a single table but 3-4 of them.
What I want to do is to reduce the time it takes to create a report without significantly changing the technologies or schemes of the solution.
Possible solutions
What I was thinking about this is:
Splitting one database into several databases for 1 database per each group key. Then I will get rid of WHERE groupKey = X (though I have index on that column in each table) and the number of rows to process each time would be significantly less.
Creating the slave database for reads only. Then I will have to sync the data with replication mechanism of PostgreSQL for example once per 15 minutes (Can I actually do that? Or I have to write custom code)
I don't want to change the database to NoSQL because I will have to rewrite all sql queries and I don't want to. I might switch to another SQL database with column store support if it is free and runs on Windows (sorry, don't have Linux server but might have one if I have to).
Your ideas
What would you recommend as the first simple steps?
Two thoughts immediately come to mind for reporting:
1). Set up some summary (aka "aggregate") tables that are precomputed results of the queries that your users are likely to run. Eg. A table containing the counts and sums grouped by the various dimensions. This can be an automated process -- a db function (or script) gets run via your job scheduler of choice -- that refreshes the data every N minutes.
2). Regarding replication, if you are using Streaming Replication (PostgreSQL 9+), the changes in the master db are replicated to the slave databases (hot standby = read only) for reporting.
Tune the report query. Use explain. Avoid procedure when you could do it in pure sql.
Tune the server; memory, disk, processor. Take a look at server config.
Upgrade postgres version.
Do vacuum.
Out of 4, only 1 will require significant changes in the application.

Rolling database - SQL Server 2008

I am trying to come up with an archiving solution and would like to implement the following architecture:
Main Table - kept small.
Copy Job - takes a 3 month worth of data and copies the data into an archive table.
When the archive tables reaches certain number of record the job creates a new table thus the database keeps rolling accumulating approx a calendar year worth of records.
My questions are:
Are there any ready solutions I can refer to?
Common design practices to execute on?
For SQL Server 2005+, take a look at Partitioned Tables and Indexes in SQL Server 2005, especially the Sliding-Window Scenario portion of the article.

Resources