I have a Table in sql server consisting of 200 million records in two different servers. I need to move this table from Server 1 to Server 2.
Table in server 1 can be a subset or a superset of the table in server 2. Some of the records(around 1 million) in server 1 are updated which I need to update in server 2. So currently I am following this approach :-
1) Use SSIS to move data from server 1 to staging database in server 2.
2) Then compare data in staging with the table in server 2 column by column. If any of the column is different, I update the whole row.
This is taking a lot of time. I tried using hashbytes inorder to compare rows like this:-
HASHBYTES('sha',CONCAT(a.[account_no],a.[transaction_id], ...))
<>
HASHBYTES('sha',CONCAT(b.[account_no],b.[transaction_id], ...))
But this is taking even more time.
Any other approach which can be faster and can save time?
This is a problem that's pretty common.
First - do not try and do the updates directly in SQL - the performance will be terrible, and will bring the database server to its knees.
In context, TS1 will be the table on Server 1, TS2 will be the table on Server 2
Using SSIS - create two steps within the job:
First, find the deleted - scan TS2 by ID, and any TS2 ID that does not exist in TS1, delete it.
Second, scan TS1, and if the ID exists in TS2, you will need to update that record. If memory serves, SSIS can inspect for differences and only update if needed, otherwise, just execute the update statement.
While scanning TS1, if the ID does not exist in TS2, then insert the record.
I can't speak to performance on this due to variations in schemas as servers, but it will be compute intensive to analyze the 200mm records. It WILL take a long time.
For on-going execution, you will need to add a "last modified date" timestamp to each record and a trigger to update the field on any legitimate change. Then use that to filter out your problem space. The first scan will not be terrible, as it ONLY looks at the IDs. The insert/update phase will actually benefit from the last modified date filter, assuming the number of records being modified is small (< 5%?) relative to the overall dataset. You will also need to add an index to that column to aid in the filtering.
The other option is to perform a burn and load each time - disable any constraints around TS2, truncate TS2 and copy the data into TS2 from TS1, finally reenabling the constraints and rebuild any indexes.
Best of luck to you.
Related
I have a database table which have more than 1 million records uniquely identified by a GUID column. I want to find out which of these record or rows was selected or retrieved in the last 5 years. The select query can happen from multiple places. Sometimes the row will be returned as a single row. Sometimes it will be part of a set of rows. there is select query that does the fetching from a jdbc connection from a java code. Also a SQL procedure also fetches data from the table.
My intention is to clean up a database table.I want to delete all rows which was never used( retrieved via select query) in last 5 years.
Does oracle DB have any inbuild meta data which can give me this information.
My alternative solution was to add a column LAST_ACCESSED and update this column whenever I select a row from this table. But this operation is a costly operation for me based on time taken for the whole process. Atleast 1000 - 10000 records will be selected from the table for a single operation. Is there any efficient way to do this rather than updating table after reading it. Mine is a multi threaded application. so update such large data set may result in deadlocks or large waiting period for the next read query.
Any elegant solution to this problem?
Oracle Database 12c introduced a new feature called Automatic Data Optimization that brings you Heat Maps to track table access (modifications as well as read operations). Careful, the feature is currently to be licensed under the Advanced Compression Option or In-Memory Option.
Heat Maps track whenever a database block has been modified or whenever a segment, i.e. a table or table partition, has been accessed. It does not track select operations per individual row, neither per individual block level because the overhead would be too heavy (data is generally often and concurrently read, having to keep a counter for each row would quickly become a very costly operation). However, if you have you data partitioned by date, e.g. create a new partition for every day, you can over time easily determine which days are still read and which ones can be archived or purged. Also Partitioning is an option that needs to be licensed.
Once you have reached that conclusion you can then either use In-Database Archiving to mark rows as archived or just go ahead and purge the rows. If you happen to have the data partitioned you can do easy DROP PARTITION operations to purge one or many partitions rather than having to do conventional DELETE statements.
I couldn't use any inbuild solutions. i tried below solutions
1)DB audit feature for select statements.
2)adding a trigger to update a date column whenever a select query is executed on the table.
Both were discarded. Audit uses up a lot of space and have performance hit. Similary trigger also had performance hit.
Finally i resolved the issue by maintaining a separate table were entries older than 5 years that are still used or selected in a query are inserted. While deleting I cross check this table and avoid deleting entries present in this table.
I have table A that is on server 1 and table B that is on server 2.
Table contain around 1.5 million rows.
What would be the fastest way to copy table A to server B? On nightly basis.
Or what would be the fastest way to bring only records that changed in table A and bring it to table B?
So far I tried MERGE along with HASHBYTES function to only capture records that changed. It works perfectly if target and source tables are on the same server. (takes approx 1 min).
But if target is on server B but the source is on server A - than it takes more than 15 min.
What is on your opinion the best and fastest technique for such operations?
Some sorts of replications? Or maybe SSIS would be the best for that?
My 2 cents. Since you qualified your question with "On nightly basis", I'd say do this in SSIS.
I would use SSIS, it is designed to do fast large data copies between servers.
Also, if you can drop table B then you could try using SELECT INTO rather than INSERT INTO.
SELECT INTO is much faster as it is minimally logged but note that table B will be locked while the insert is running.
You could also try disabling indexes on Table B before you insert and re enabling them later.
I have to get data from many tables and combine them into a single one.
The final table will have about 120 millions rows.
I'm planning to insert the rows in the exact order needed by the big table indexes.
My question is, in terms of performance:
Is it better create the indexes of the new table from the start, or first make the inserts and at the end of the import create the indexes ?
Also, would it make a difference if, when building indexes at the end, the rows are already sorted in terms of indexes specifications ?
I can't test both cases and get an objective comparison since the database is on the main server which is used for many other databases and applications which can be heavy loaded or not on different moment of times. I can't restore the database to my local server either, since I don't have full access to the main server yet.
I suggest that copy date in first and then create your indexes. If you insert records on the table that have index, for each insert, SQL Server refresh table index. but when you create index after insert all record to your table, SQL Server don't need to refresh table index for each insert, and rebuild index one way.
You can use SSIS in order to copy data from source tables to destination. SSIS use balk insert and have good performance. also if you have any trigger on destination database, I suggest that disable that before start your convert.
When you create index each time on your table, rows stored in terms of your index.
Let's say I have 2 servers, and one identical table per server. In each tables, I have identity increment on (by 1 if u ask), and there is 'time' column to note when was the record updated/inserted.
so kinda like this:
ID Content Time
1 banana 2011-01-01 09:59:23.000
2 apple 2011-01-02 12:41:01.000
3 pear 2011-04-05 04:05:44.000
I want to copy (insert/update) all the contents from one table to another periodically with this requirements:
a. copy (insert/update) only before certain MONTH. i.e before August 2011. this is easy though.
b. insert only if records is really new (maybe if the ID isn't exist?)
c. update if you find the 'Time' column is newer (basically that means there is an update at that record) than the last performed copy (I save the date/time of last copy too)
I could do all that by building a program and check it record by record, but with hundred thousands records, it would be pretty slow.
Could it be done using just query?
Btw I'm using this query to copy between servers and I'm using SQL Server 2005
INSERT OPENQUERY(TESTSERVER, 'SELECT * FROM Table1')
SELECT * FROM Table1
thx for da help :)
My strategy in a situation like this is:
first do an outer join and determine the state of the data into a temp table
e.g.
#temp
id state
1 update
2 copy
3 copy
4 insert
Then run n statements joining the three tables together, one for each of the states. Sometimes you can eliminate the multiple statements by entering empty rows in the target with the correct keys. Then you only need to do a more complex update/copy.
However since these are cross server - I'd suggest the following strategy - copy the source table over entirely to the other server, and then do the above on the other server.
Only after this do I do performance analysis to optimise if needed.
So for this one project, we have a bunch of queries that are executed on a regular basis (every minute or so. I used the "Analyze Query in Database Engine " to check on them.
They are pretty simple:
select * from tablex where processed='0'
There is an index on processed, and each query should return <1000 rows on a table with 1MM records.
The Analyzer recommended creating some STATISTICS on this.... So my question is: What are those statistics ? do they really help performance ? how costly are they for a table like above ?
Please bear in mind that by no means I would call myself a SQL Server experienced user ... And this is the first time using this Analyzer.
Statistics are what SQL Server uses to determine the viability of how to get data.
Let's say, for instance, that you have a table that only has a clustered index on the primary key. When you execute SELECT * FROM tablename WHERE col1=value, SQL Server only has one option, to scan every row in the table to find the matching rows.
Now we add an index on col1 so you assume that SQL Server will use the index to find the matching rows, but that's not always true. Let's say that the table has 200,000 rows and col1 only has 2 values: 1 and 0. When SQL Server uses an index to find data, the index contains pointers back to the clustered index position. Given there's only two values in the indexed column, SQL Server decides it makes more sense to just scan the table because using the index would be more work.
Now we'll add another 800,000 rows of data to the table, but this time the values in col1 are widely varied. Now it's a useful index because SQL Server can viably use the index to limit what it needs to pull out of the table. Will SQL Server use the index?
It depends. And what it depends on are the Statistics. At some point in time, with AUTO UPDATE STATISTICS set on, the server will update the statistics for the index and know it's a very good and valid index to use. Until that point, however, it will ignore the index as being irrelevant.
That's one use of statistics. But there is another use and that isn't related to indices. SQL Server keeps basic statistics about all of the columns in a table. If there's enough different data to make it worthwhile, SQL Server will actually create a temporary index on a column and use that to filter. While this takes more time than using an existing index, it takes less time than a full table scan.
Sometimes you will get recommendations to create specific statistics on columns that would be useful for that. These aren't indices, but the do keep track of the statistical sampling of data in the column so SQL Server can determine whether it makes sense to create a temporary index to return data.
HTH
In Sql Server 2005, set auto create statistics and auto update statistics. You won't have to worry about creating them or maintaining them yourself, since the database handles this very well itself.