SQL Server Complex Add/Update merge million rows daily - sql-server

I have 7 reports which are downloaded daily at late night.
These reports can be downloaded in csv/xml. I am downloading them csv format as they are memory efficient.
This process runs in background and is managed by hangfire.
After they are downloaded, I am using dapper to run a stored procedure which insert/update/update data using merge statements. This stored procedure has seven table value parameters.
Instead of delete, I am updating that record's IsActive column to false.
Note that 2 reports have more than 1 million records.
I am getting timeout exceptions only in Azure SQL. In SQL Server, it works fine. As a workaround, I have increased timeouts to 1000 for this query.
This app is running in Azure s2.
I have pondered over the option of sending xml but I have found SQL Server is slow at processing xml which counter productive.
I can not also use SqlBulkCopy as I have to update based on some conditions.
Also note that more reports will be added in future.
Also when a new report is added then there are large amount of inserts. If previously added report is ran again then majority updates are run.
These tables currently do not have any indexes, only clustered integer primary key.
Each row has a unique code. This code is used to identify whether to insert/update/delete
Can you recommend a way to increase performance?

Is your source inputting the whole data? Whether they are updated/new. I assume by saying the unique code(insert/update/delete) you are only considering changes (Delta). If not that's one area. Another is to consider parallelism. I think then you need to have different stored procedures for each table. Non dependent tables could be processed together

Related

Can I use Hadoop to speed up a slow SQL stored procedure?

The problem:
I have 2 SQL Server databases from 2 different applications. They describe different aspects of industrial machines: one is about "how many consumables were spent per order", the other is about "how many good/bad production items were produced per operator". Sometimes many operators are working on 1 order one after another, sometimes one operator is working on multiple small orders, and there is no connection Order-Operator in the database.
I want to have united fact table, where for every timestamp I know MachineID, OrderID and OperatorID. If a timestamp exists in DB1, then the record will have numeric measures from it (Consumables); if it exists in DB2, then it will have numeric measures from DB2 (good/bad production items). If it exists in both databases, then it have all numeric measures. A simple UNION ALL is not enough, because I want to have MachineID, OrderID and OperatorID for every record.
I created a T-SQL stored procedure to make FULL JOIN by timestamp and MachineID. But on large data sets (multiple machines, multiple customers) it becomes very slow. Both applications support editing history, so I need to merge full history from both databases at every nightly load.
To speed up the process, I would like to put calculations into multiple parallel threads, separated by Customer, MachineID, and Year.
I tried to do it by using SQL Server stored procedures, running in parallel by SQL Agent with different parameters, but I found that it didn't help the performance. Instead it created multiple deadlocks when updating staging and final tables.
I am looking for an alternative way to resolve this problem, but I don't know what is the right tool. Can Hadoop or similar parallel processing tool help with this task?
I am looking for solution with minimal cost, because it is needed for just one specific task. For everything else, SQL Server and PowerBI reporting are working just fine for me.
Hadoop seems hard to justify in this use case, given limited scope. The thing about Hadoop is that it scales well not only due to parallel processing but thanks to parallel IO, when data is distributed across multiple servers/storage media. Unless you happy to copy all data to HDFS distributed among multiple nodes, it likely will not help much. If you want to spin up a Hadoop cluster and run multiple jobs querying single SQL server, it'll likely end up badly for the later.
Have you considered optimizations which will allow you to limit the amount of data you processing nightly?
E.g. what is 'timestamp' field? Does it reflect last update time? Can you use it to filter rows which haven't been updated since the previous run?
Even if the 'timestamp' is not the time of last updates, can you add an "updateTime" field and triggers on updates which will populate the field, so you don't need to import rows which have not changed since the previous run? If you build an index on the field, then, if the number of updates during the day is not high relative to total table size, a query with a filter on such field will hit the index, and fetching of incremental changes should be fast.
Another thing to consider - are those DBs running on the same node/SQL server? Access to remote DBs is slow, so if that's the case, think about how to fix this first.

handling performance issues in SSIS

I have billions of records to transmit via SSMS and looking at improving the migration speed. I am trying to save the resultset into a table. I am taking into account my front end would be slow since all that data would be in one table. I am basically looking at various options. So I am thinking of having multiple physical tables. I just need last 5 years of data. Will it be faster If I execute 5 different versions of the same stored procedures with different year filters and populate the tables . I know one can achieve parallelism in SSIS. My only fear is since all the five storedprocedures are running in parallel , will they lock down the tables
You could look into table partitioning schemes in sql server. It seems like the Year column would be a good field to use in your partitioning function.
Table Partitioning in SQL Server

Warehouse PostgreSQL database architecture recommendation

Background:
I am developing an application that allows users to generate lots of different reports. The data is stored in PostgreSQL and has natural unique group key, so that the data with one group key is totally independent from the data with others group key. Reports are built only using 1 group key at a time, so all of the queries uses "WHERE groupKey = X;" clause. The data in PostgreSQL updates intensively via parallel processes which adds data into different groups, but I don't need a realtime report. The one update per 30 minutes is fine.
Problem:
There are about 4 gigs of data already and I found that some reports takes significant time to generate (up to 15 seconds), because they need to query not a single table but 3-4 of them.
What I want to do is to reduce the time it takes to create a report without significantly changing the technologies or schemes of the solution.
Possible solutions
What I was thinking about this is:
Splitting one database into several databases for 1 database per each group key. Then I will get rid of WHERE groupKey = X (though I have index on that column in each table) and the number of rows to process each time would be significantly less.
Creating the slave database for reads only. Then I will have to sync the data with replication mechanism of PostgreSQL for example once per 15 minutes (Can I actually do that? Or I have to write custom code)
I don't want to change the database to NoSQL because I will have to rewrite all sql queries and I don't want to. I might switch to another SQL database with column store support if it is free and runs on Windows (sorry, don't have Linux server but might have one if I have to).
Your ideas
What would you recommend as the first simple steps?
Two thoughts immediately come to mind for reporting:
1). Set up some summary (aka "aggregate") tables that are precomputed results of the queries that your users are likely to run. Eg. A table containing the counts and sums grouped by the various dimensions. This can be an automated process -- a db function (or script) gets run via your job scheduler of choice -- that refreshes the data every N minutes.
2). Regarding replication, if you are using Streaming Replication (PostgreSQL 9+), the changes in the master db are replicated to the slave databases (hot standby = read only) for reporting.
Tune the report query. Use explain. Avoid procedure when you could do it in pure sql.
Tune the server; memory, disk, processor. Take a look at server config.
Upgrade postgres version.
Do vacuum.
Out of 4, only 1 will require significant changes in the application.

Copy Multiple Tables into ONE Table (From Multiple Databases)

I've got multiple identical databases (distributed on several servers) and need to gather them to one single point to do data mining, etc.
The idea is to take Table1, Table2, ..., TableN from each database and merge them and put the result into one single big database.
To be able to write queries, and to know from which database each row came from we will add a single column DatabaseID to target table, describing where the row came from.
Editing the source tables is not an option, it belongs to some proprietary software.
We've got ~40 servers, ~170 databases and need to copy ~40 tables.
Now, how should we implement this given that it should be:
Easy to setup
Easy to maintain
Preferably easy to adjust if database schema changes
Reliable, logging/alarm if something fails
Not too hard to add more tables to copy
We've looked into SSIS, but it seemed that we would have to add each table as a source/transformation/destination. I'm guessing it would also be quite tied to the database schema. Right?
Another option would be to use SQL Server Replication, but I don't see how to add the DatabaseID column to each table. It seems it's only possible to copy data, not modify it.
Maybe we could copy all the data into separate databases, and then to run a local job on the target server to merge the tables?
It also seems like a lot of work if we'd need to add more tables to copy, as we'd have to redistribute new publications for each database (manual work?).
Last option (?) is to write a custom application to our needs. Bigger time investment, but it'd at least do precisely what we'd like.
To make it worse... we're using Microsoft SQL Server 2000.
We will upgrade to SQL Server 2008 R2 within 6 months, but we'd like the project to be usable sooner.
Let me know what you guys think!
UPDATE 20110721
We ended up with a F# program opening a connection to the SQL Server where we would like the aggregated databases. From there we query the 40 linked SQL Servers to fetch all rows (but not all columns) from some tables, and add an extra row to each table to say which DatabaseID the row came from.
Configuration of servers to fetch from, which tables and which columns, is a combination of text file configuration and hard coded values (heh :D).
It's not super fast (sequential fetching so far) but it's absolutely manageable, and the data processing we do afterwards takes far longer time.
Future improvements could be to;
improve error handling if it turns out to be a problem (if a server isn't online, etc).
implement parallel fetching, to reduce the total amount of time to finish fetching.
figure out if it's enough to fetch only some of the rows, like only what's been added/updated.
All in all it turned out to be quite simple, no dependencies to other products, and it works well in practice.
Nothing fancy but couldn't you do something like
DROP TABLE dbo.Merged
INSERT INTO dbo.Merged
SELECT [DatabaseID] = "Database1", * FROM ServerA.dbo.Table
UNION ALL SELECT [DatabaseID] = "Database2", * FROM ServerB.dbo.Table
...
UNION ALL SELECT [DatabaseID] = "DatabaseX", * FROM ServerX.dbo.Table
Advantages
Easy to setup
Easy to maintain
Easy to adjust
Easy to add more tables
Disadvantages
Performance
Reliable logging
We had a similar requirement where we took a different approach. first created a central database to collect the data. Then we created a inventory table to store the list of target servers / databases. Then a small vb.net based CLR procedure which take the path of SQL query, target SQL Instance name and the target table which will store the data(This would eliminate the setup of linked server when new targets are added). This also adds two additional columns to the result set. The Target server name and the timestamp when the data is captured.
Then we set up a service broker queue/service and pushed list of target servers to interogate.
The above CLR procedure is wrapped in another procedure which dequeues the message, executes the SQL on the target server provided. The wrapper procedure is then configured as the activated procedure for the queue.
With this we are able to achieve a bit of parallelism to capture the data.
Advantages :
Easy to setup Easy to manage (Add / Remove targets)
Same framework works for multiple queries
Logging tables to check for failed queries.
Works independent of each target, so if one of the target fails to
respond, others still continue.
Workflow can be pause gracefully by disabling the queue (for
maintenance on central server) and then resume collection be
re-enabling it.
Disadvantage:
requires good understanding of service brokers.
should properly handle poison messages.
Please Let me know if it helps

SpeedUp Database Updates

There is a SqlServer2000 Database we have to update during weekend.
It's size is almost 10G.
The updates range from Schema changes, primary keys updates to some Million Records updated, corrected or Inserted.
The weekend is hardly enough for the job.
We set up a dedicated server for the job,
turned the Database SINGLE_USER
made any optimizations we could think of: drop/recreate indexes, relations etc.
Can you propose anything to speedup the process?
SQL SERVER 2000 is not negatiable (not my decision). Updates are run through custom made program and not BULK INSERT.
EDIT:
Schema updates are done by Query analyzer TSQL scripts (one script per Version update)
Data updates are done by C# .net 3.5 app.
Data come from a bunch of Text files (with many problems) and written to local DB.
The computer is not connected to any Network.
Although dropping excess indexes may help, you need to make sure that you keep those indexes that will enable your upgrade script to easily find those rows that it needs to update.
Otherwise, make sure you have plenty of memory in the server (although SQL Server 2000 Standard is limited to 2 GB), and if need be pre-grow your MDF and LDF files to cope with any growth.
If possible, your custom program should be processing updates as sets instead of row by row.
EDIT:
Ideally, try and identify which operation is causing the poor performance. If it's the schema changes, it could be because you're making a column larger and causing a lot of page splits to occur. However, page splits can also happen when inserting and updating for the same reason - the row won't fit on the page anymore.
If your C# application is the bottleneck, could you run the changes first into a staging table (before your maintenance window), and then perform a single update onto the actual tables? A single update of 1 million rows will be more efficient than an application making 1 million update calls. Admittedly, if you need to do this this weekend, you might not have a lot of time to set this up.
What exactly does this "custom made program" look like? i.e. how is it talking to the data? Minimising the amount of network IO (from a db server to an app) would be a good start... typically this might mean doing a lot of work in TSQL, but even just running the app on the db server might help a bit...
If the app is re-writing large chunks of data, it might still be able to use bulk insert to submit the new table data. Either via command-line (bcp etc), or through code (SqlBulkCopy in .NET). This will typically be quicker than individual inserts etc.
But it really depends on this "custom made program".

Resources