SQL Server 2008 R2 Distributed Partition View Update/Delete issue

SQL Server 2008 R2 Distributed Partition View Update/Delete issue - sql-server

I have a big data table for store articles(has more than 500 million record), therefore I use distributed partition view feature of SQL Server 2008 across 3 servers.
Select and Insert operations work fine. But Delete or Update action take long time and never complete.
In Processes tab of Activity Monitor, I see Wait Type field is "PREEMPTIVE_OLEDBOPS" for Update command.
Any idea what's the problem?
Note: I think problem with MSDTC, because Update command not shown in SQL Profiler of second server. but when check MSDTC status on the same server, status column is Update(active).

What is most likely happening is that all the data from the other server is pulled over to the machine where the query is running before the filter of your update statement is applied. This can happen when you use 4-part naming. Possible solutions are:
Make sure each table has a correct "check constraint" which defines the minimum and maximum value of the partition table. Without this partition elimination is not going to work properly.
Call a stored procedure with 4-part naming on the other server to do the update.
use OPENQUERY() to connect to the other server
To serve 500 million records your server seems to be adequate. A setup with Table Partitioning with a sliding window is probably a more cost effective way of handling the volume.

Related

Most efficient way to load large table+data from Oracle11G into SQL Server 2012 (I have a linked server setup)

I have two databases, one is SQL Server 2012 and the other one is an Oracle 11G database.
I need to copy a large table with 100 million records from Oracle into the SQL Server database.
What is the most efficient way to copy those 100 million records into a SQL server?
I have a linked server setup in my SQL Server which is pointing at the Oracle Database.

You can think of below things, before planning data movement.
CREATE TABLE with the right datatype definition, so that you will avoid IO issues, whild loading data. Define the minimum datatype to suit your needs. Do data profiling in the ORACLE side to decide how the columns are defined, what is the maximum length for columns, what is actually being used in ORACLE side.
e.g., If the column is age, then tinyint is enough(supports 1- 255).
The target SQL Server table should not have any PK or FK constraints enabled. Disable all constraints, before loading data. You can also think of dropping and recreating PK, FK.
The target SQL Server tables should not have any indexes enabled. Disable all indexes, before loading data. You can also think of dropping and recreating indexes.
Do the insertion in batches. Having a huge INSERT statement as a single batch, can fail due to many reasons : Network problem suddenly, sudden huge log writing, sudden memory footprint increase etc. If you are having a timestamp column or unique identifier column in ORACLE, apply filter and get data in batches. Have the #Id as configurable.
DECLARE Id INTEGER
SELECT * FROM OracleTable WHERE Id < 1000000 -- first batch
SELECT * FROM oracleTable WHERE Id > 1000000 AND Id < 2000000
Try to go for 64 bit ORACLE drivers for linked server. They will give better performance, compared to 32 bit drivers.
Oracle 64 bit driver linked server
Schedule this operation in less obtrusive time to get more resources available for you in the SQL Server side.
Do a small testing with small amount of data to check whether everything is working fine, before going for full fledged approach, to avoid intermittent issues.
See whether you have parallelism enabled in the SQL Server side, to take advantage of parallel insert. Max degree of parallelism

How can I improve cardinality estimates on staging tables?

I support a process that runs every night and looks at various clients that have invoices with unpaid line items. The process starts with deleting all records from a staging table and then inserting a number of invoices line items into the staging table. The process run on a per-client basis, so some clients may have 200 line items, some clients may have 50,000. We are constantly having issues with the process running an exorbitant amount of time. The issue seems to stem from an inability for SQL server to estimate the correct number of rows that are in the staging table at the time and therefore is generating a bad execution plan. My question is, is there a way to manually set the estimated number of rows to improve cardinality estimates for the stored procedures involved? Perhaps this could be done through a select count(primaryKey) at the beginning of the run, right after the current runs staging table is populated?

You are executing big batch processes on this table. It's a good approach to delete all indexes before your batch and create them again after the batch.
If you do this, your statistics will be updated and won't be the cause of your problem.
Pay heed also to more generic information about statistics: The update statistics changed a lot between SQL Server 2014 and SQL Server 2016. If you are running SQL Server 2016, you need to check if your database is using the new cardinality estimator created for SQL Server 2016. Just check if your database is running with SQL Server 2016 compatibility level.
If you are running SQL Server 2014, a good option is to enable the trace flag 2371. This trace flag improves the criteria SQL Server uses to automatic update statistics. You should use SQL Server Configuration Manager to enable this trace flag.
However, if you follow the first suggestion, deleting and creating the indexes, the other two suggestion will have low or none impact.

Database Engine Update Logic

When a record is updated in a SQL Server table, how does the db engine physically execute such a request: is it INSERT + DELETE or UPDATE operation?
As we know, the performance of a database and any statements depends on many variables. But I would like to know if some things can be generalized.
Is there a threshold (table size, query length, # records affected...) after which the database switches to one approach or the other upon UPDATEs?
If there are times when SQL Server is physically performing insert/delete when a logical update is requested, is there a system view or metric that would show this? - i.e, if there is a running total of all the inserts, updates and deletes that the database engine has performed since it was started, then I would be able to figure out how the database behaves after I issue a single UPDATE.
Is there any difference between the UPDATE statement's behavior depending on SQL Server version (2008, 2012...)
Many thanks.
Peter

UPDATE on base table without triggers is always physical UPDATE. SQL Server has no such threshold. You can look up usage statistics, for example, in sys.dm_db_index_usage_stats.

Update edits the existing row. If it was insert/delete, then you'd get update failures for duplicate keys.
Insert/Update/Delete also all can be discretely permissioned. So a user could update records, but not insert or delete, also leading to that not being the way it works.

SQL Server high volume update on one field

I am building a VoIP switch and I am going to be doing an insert using a SQL stored procedure.
I need to update the user table "balance field" each time I update the history table. Due to it being a switch I can have hundreds of updates each second.
I wanted to know the best way to update a field with out dead locks and with out wrong info.
I will be using MS sql server 2012.

Partition the user table into evenly sized partitions - SQL 2012 allows 10000 of them. That way the updates are distributed over many allocation units instead of just one. Then add the WITH(ROWLOCK) hint to the update query.
To kick off the actual update you could use a trigger.

Copy Multiple Tables into ONE Table (From Multiple Databases)

I've got multiple identical databases (distributed on several servers) and need to gather them to one single point to do data mining, etc.
The idea is to take Table1, Table2, ..., TableN from each database and merge them and put the result into one single big database.
To be able to write queries, and to know from which database each row came from we will add a single column DatabaseID to target table, describing where the row came from.
Editing the source tables is not an option, it belongs to some proprietary software.
We've got ~40 servers, ~170 databases and need to copy ~40 tables.
Now, how should we implement this given that it should be:
Easy to setup
Easy to maintain
Preferably easy to adjust if database schema changes
Reliable, logging/alarm if something fails
Not too hard to add more tables to copy
We've looked into SSIS, but it seemed that we would have to add each table as a source/transformation/destination. I'm guessing it would also be quite tied to the database schema. Right?
Another option would be to use SQL Server Replication, but I don't see how to add the DatabaseID column to each table. It seems it's only possible to copy data, not modify it.
Maybe we could copy all the data into separate databases, and then to run a local job on the target server to merge the tables?
It also seems like a lot of work if we'd need to add more tables to copy, as we'd have to redistribute new publications for each database (manual work?).
Last option (?) is to write a custom application to our needs. Bigger time investment, but it'd at least do precisely what we'd like.
To make it worse... we're using Microsoft SQL Server 2000.
We will upgrade to SQL Server 2008 R2 within 6 months, but we'd like the project to be usable sooner.
Let me know what you guys think!
UPDATE 20110721
We ended up with a F# program opening a connection to the SQL Server where we would like the aggregated databases. From there we query the 40 linked SQL Servers to fetch all rows (but not all columns) from some tables, and add an extra row to each table to say which DatabaseID the row came from.
Configuration of servers to fetch from, which tables and which columns, is a combination of text file configuration and hard coded values (heh :D).
It's not super fast (sequential fetching so far) but it's absolutely manageable, and the data processing we do afterwards takes far longer time.
Future improvements could be to;
improve error handling if it turns out to be a problem (if a server isn't online, etc).
implement parallel fetching, to reduce the total amount of time to finish fetching.
figure out if it's enough to fetch only some of the rows, like only what's been added/updated.
All in all it turned out to be quite simple, no dependencies to other products, and it works well in practice.

Nothing fancy but couldn't you do something like
DROP TABLE dbo.Merged
INSERT INTO dbo.Merged
SELECT [DatabaseID] = "Database1", * FROM ServerA.dbo.Table
UNION ALL SELECT [DatabaseID] = "Database2", * FROM ServerB.dbo.Table
...
UNION ALL SELECT [DatabaseID] = "DatabaseX", * FROM ServerX.dbo.Table
Advantages
Easy to setup
Easy to maintain
Easy to adjust
Easy to add more tables
Disadvantages
Performance
Reliable logging

We had a similar requirement where we took a different approach. first created a central database to collect the data. Then we created a inventory table to store the list of target servers / databases. Then a small vb.net based CLR procedure which take the path of SQL query, target SQL Instance name and the target table which will store the data(This would eliminate the setup of linked server when new targets are added). This also adds two additional columns to the result set. The Target server name and the timestamp when the data is captured.
Then we set up a service broker queue/service and pushed list of target servers to interogate.
The above CLR procedure is wrapped in another procedure which dequeues the message, executes the SQL on the target server provided. The wrapper procedure is then configured as the activated procedure for the queue.
With this we are able to achieve a bit of parallelism to capture the data.
Advantages :
Easy to setup Easy to manage (Add / Remove targets)
Same framework works for multiple queries
Logging tables to check for failed queries.
Works independent of each target, so if one of the target fails to
respond, others still continue.
Workflow can be pause gracefully by disabling the queue (for
maintenance on central server) and then resume collection be
re-enabling it.
Disadvantage:
requires good understanding of service brokers.
should properly handle poison messages.
Please Let me know if it helps

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight