Efficient way to avoid inserting duplicate records - database

Scenario of Records Insertion from one DB to another:
Suppose I have 100,000 records in db1 and I am selecting a batch of records from database db1 and
inserting it into another database db2. By selecting a batch of records I mean I am selecting 1000 records
at a time. Next time, I run the same query to select another batch of 1000 records, I want to start from
record number 1001 as all the previous records 1000 records starting from row #1 have already been
inserted into db2.
So, basically, second time, I want to avoid copying the same number of record again in order to avoid
the duplicate records in the database db2.
In order to avoid the insertion of duplicate records, one approach, I am following right now is,
I have a integer value column named flag in my database db1 with a negative value, say for example -1 and as soon
as I grab 1000 records, I am updating the value to positive value , that is +1. So that, the second time
I will run the query to grab the next set of 1000 records, it will start from the record #1001.
I am wondering whether this is an efficient approach as everytime I run my query, my query will be checking
all the records in db1 right from the start and then ultimately start grabbing the data from where it
will notice the start of negative value in the flag column. Please suggest if there is much more efficient
approach.
P.S: I am using JDBC to grab the records and insert it into another database.
Thanks

I would not update the table in db1 just for this purpouse.
I'd rather create another table either in db1 or, maybe better, in db2, for control purpouses.
This way you will not need to cope with write access rigths in the original table in db1 and/or with race or collision or multiuser conflict conditions.
The second table would record, for each extraction, the timestamp, tables involved, range extracted, user name, process name... and any other control information you might be interested.

Related

Find out the recently selected rows from a Oracle table and can I update a LAST_ACCESSED column whenever the table is accessed

I have a database table which have more than 1 million records uniquely identified by a GUID column. I want to find out which of these record or rows was selected or retrieved in the last 5 years. The select query can happen from multiple places. Sometimes the row will be returned as a single row. Sometimes it will be part of a set of rows. there is select query that does the fetching from a jdbc connection from a java code. Also a SQL procedure also fetches data from the table.
My intention is to clean up a database table.I want to delete all rows which was never used( retrieved via select query) in last 5 years.
Does oracle DB have any inbuild meta data which can give me this information.
My alternative solution was to add a column LAST_ACCESSED and update this column whenever I select a row from this table. But this operation is a costly operation for me based on time taken for the whole process. Atleast 1000 - 10000 records will be selected from the table for a single operation. Is there any efficient way to do this rather than updating table after reading it. Mine is a multi threaded application. so update such large data set may result in deadlocks or large waiting period for the next read query.
Any elegant solution to this problem?
Oracle Database 12c introduced a new feature called Automatic Data Optimization that brings you Heat Maps to track table access (modifications as well as read operations). Careful, the feature is currently to be licensed under the Advanced Compression Option or In-Memory Option.
Heat Maps track whenever a database block has been modified or whenever a segment, i.e. a table or table partition, has been accessed. It does not track select operations per individual row, neither per individual block level because the overhead would be too heavy (data is generally often and concurrently read, having to keep a counter for each row would quickly become a very costly operation). However, if you have you data partitioned by date, e.g. create a new partition for every day, you can over time easily determine which days are still read and which ones can be archived or purged. Also Partitioning is an option that needs to be licensed.
Once you have reached that conclusion you can then either use In-Database Archiving to mark rows as archived or just go ahead and purge the rows. If you happen to have the data partitioned you can do easy DROP PARTITION operations to purge one or many partitions rather than having to do conventional DELETE statements.
I couldn't use any inbuild solutions. i tried below solutions
1)DB audit feature for select statements.
2)adding a trigger to update a date column whenever a select query is executed on the table.
Both were discarded. Audit uses up a lot of space and have performance hit. Similary trigger also had performance hit.
Finally i resolved the issue by maintaining a separate table were entries older than 5 years that are still used or selected in a query are inserted. While deleting I cross check this table and avoid deleting entries present in this table.

Fastest way to compare multiple column values in sql server?

I have a Table in sql server consisting of 200 million records in two different servers. I need to move this table from Server 1 to Server 2.
Table in server 1 can be a subset or a superset of the table in server 2. Some of the records(around 1 million) in server 1 are updated which I need to update in server 2. So currently I am following this approach :-
1) Use SSIS to move data from server 1 to staging database in server 2.
2) Then compare data in staging with the table in server 2 column by column. If any of the column is different, I update the whole row.
This is taking a lot of time. I tried using hashbytes inorder to compare rows like this:-
HASHBYTES('sha',CONCAT(a.[account_no],a.[transaction_id], ...))
<>
HASHBYTES('sha',CONCAT(b.[account_no],b.[transaction_id], ...))
But this is taking even more time.
Any other approach which can be faster and can save time?
This is a problem that's pretty common.
First - do not try and do the updates directly in SQL - the performance will be terrible, and will bring the database server to its knees.
In context, TS1 will be the table on Server 1, TS2 will be the table on Server 2
Using SSIS - create two steps within the job:
First, find the deleted - scan TS2 by ID, and any TS2 ID that does not exist in TS1, delete it.
Second, scan TS1, and if the ID exists in TS2, you will need to update that record. If memory serves, SSIS can inspect for differences and only update if needed, otherwise, just execute the update statement.
While scanning TS1, if the ID does not exist in TS2, then insert the record.
I can't speak to performance on this due to variations in schemas as servers, but it will be compute intensive to analyze the 200mm records. It WILL take a long time.
For on-going execution, you will need to add a "last modified date" timestamp to each record and a trigger to update the field on any legitimate change. Then use that to filter out your problem space. The first scan will not be terrible, as it ONLY looks at the IDs. The insert/update phase will actually benefit from the last modified date filter, assuming the number of records being modified is small (< 5%?) relative to the overall dataset. You will also need to add an index to that column to aid in the filtering.
The other option is to perform a burn and load each time - disable any constraints around TS2, truncate TS2 and copy the data into TS2 from TS1, finally reenabling the constraints and rebuild any indexes.
Best of luck to you.

Is Log Sequence Number (LSN) unique for database or table in SQL Server?

I am using SQL CDC to track changes for multiple tables in SQL Server. I would want to report out these changes in right sequence for each I have a program which collects the data from each CDC table. But I want to make sure that all the changes that are happening to these tables are reported in correct sequence. Can I rely on LSN for the right sequence?
The LSN number is unique for a given transaction but is not globally unique. If you have multiple records within the same transaction they will all share the same __$start_lsn value in cdc. If you want the correct order of operations you need to sort by __$start_lsn, __$seqval, then __$operation. The __$seqval represents the id of the individual operation within the wrapping transaction.
For example, I have a table in the dbo schema named foo. It has one column y. If I run this statement:
INSERT INTO dbo.foo VALUES (1);
INSERT INTO dbo.foo VALUES (2);
Then I will see two separate LSN values in cdc because these are in two separate transactions. If I run this:
BEGIN TRAN
INSERT INTO dbo.foo VALUES (1);
INSERT INTO dbo.foo VALUES (2);
COMMIT TRAN
Then I will see one LSN value for both records, but they will have different __$seqval values, and the seqval for my first record will be less than the seqval for my second record.
LSN is unique, ever increasing within the database, across all tables in that database.
In most cases LSN value is unique across all tables, however I found instances where one single LSN value belongs to the changes in 40 tables. I don't know the SQL script that associated with those changes, but I know that all operations were 'INSERT'.
Not sure if it is a bug. CDC documentations is poor, covers just basics. Not many users know that CDC capture process has many bugs confirmed by MS for both SQL 2014 & 2016 (we have the open case).
So I would not rely on the documentation. It may be wrong in some scenarios. It's better to implement more checks and test it with large volume of different combinations of changes.
I also encountered that scenario. In my experience and what I understood is in your first example, there are 2 transactions happened so you will really get 2 different LSN. While in your second example, you only have 1 transaction with 2 queries inside. The CDC will count it as only 1 transaction since it is inside BEGIN and END TRAN. I can't provide links to you since this is my personal experience.

Efficient DELETE TOP?

Is it more efficient and ultimately FASTER to delete rows from a DB in blocks of 1000 or 10000? I am having to remove approx 3 million rows from many tables. I first did the deletes in blocks of 100K rows but the performance wasn't looking good. I changed to 10000 and seem to be removing faster. Wondering if even smaller like 1K per DELETE statement is even better.
Thoughts?
I am deleting like this:
DELETE TOP(10000)
FROM TABLE
WHERE Date < '1/1/2012'
Yes, it is. It all depends on your server though. I mean, last time I did that i was using this approeach to delete things in 64 million increments (on a table that had at that point around 14 billion rows, 80% Of which got ultimately deleted). I got a delete through every 10 seconds or so.
It really depends on your hardware. Going moreg granular is more work but it means less waiting for tx logs for other things operating on the table. You have to try out and find where you are comfortable - there is no ultimate answer because it is totally dependend on usage of the table and hardware.
We used Table Partitioning to remove 5 million rows in less than a sec but this was from just one table. It took some work up-front but ultimately was the best way. This may not be the best way for you.
From our document about partitioning:
Let’s say you want to add 5 million rows to a table but don’t want to lock the table up while you do it. I ran into a case in an ordering system where I couldn’t insert the rows without stopping the system from taking orders. BAD! Partitioning is one way of doing it if you are adding rows that don’t overlap current data.
WHAT TO WATCH OUT FOR:
Data CANNOT overlap current data. You have to partition the data on a value. The new data cannot be intertwined within the currently partitioned data. If removing data, you have to remove an entire partition or partitions. You will not have a WHERE clause.
If you are doing this on a production database and want to limit the locking on the table, create your indexes with “ONLINE = ON”.
OVERVIEW OF STEPS:
FOR ADDING RECORDS
Partition the table you want to add records to (leave a blank partition for the new data). Do not forget to partition all of your indexes.
Create new table with the exact same structure (keys, data types, etc.).
Add a constraint to the new table to limit that data so that it would fit into the blank partition in the old table.
Insert new rows into new table.
Add indexes to match old table.
Swap the new table with the blank partition of the old table.
Un-partition the old table if you wish.
FOR DELETING RECORDS
Partition the table into sets so that the data you want to delete is all on partitions by itself (this could be many different partitions).
Create a new table with the same partitions.
Swap the partitions with the data you want to delete to the new table.
Un-partition the old table if you wish.
Yes, no, it depends on the usage of table due to locking. I would try to delete the records in a slower pace. So the opposite of the op's question.
set rowcount 10000
while ##rowcount > 0
begin
waitfor delay '0:0:1'
delete
from table
where date < convert(datetime, '20120101', 112)
end
set rowcount 0

Incremental reports with JasperReports

I am using JasperReports to generate reports from SQL Server on daily basis. The problem is that every day the report reads data from beginning, but I want it to exclude records read earlier and include only new rows. The database is old and doesn't have timestamp columns in table so there is no way to identify which records are 'new' and which ones are 'old'.
I am not allowed to modify it either.
Please suggest any other way if possible.
You can create a new table and every time you print records on your report, insert that records in the table. So you can use a query with a NOT EXISTS condition from the original table on the new table.
The obvious drawbacks of this approach is space consumption on the DB and the extra work needed in inserting records on the new table, but if you cannot modify the original table, it's the only solution.
Otherwise the Alex K suggestion is very good.

Resources