In my current project, some tables have a column named "changed", which indicates if the the current line had been changed since the last check. All the insert and update statements includes this column.
Every hour, I run a schedulated task that queries all changed rows, do some stuff with those rows and then sets null to it's "changed" column.
This is potentially a performance issue, since I'm going to do lot's of write and read operations in this column, the index will be constantly being rebuild.
What's the best option for this scenario(rather than not using this kind of mechanism)?
if your table is huge, drop the column and make a dedicated table (with primary key info only) and have triggers insert into this table. you then need to just process this small table, and clear it as you finish rows. you would need to do this for each table you are tracking.
if your tables are small, the column may not be a bad idea, but you may see blocking/locking if you have lots of selects and updates on these tables, and if your scheduled processing loops or is real slow.
if you go with a column, it might be better to have a LastChgDate column, then you just process all rows within a range (you'll need to track the range to process each time), but you won't need to change the LastChgDate to show it is "done". This might be moot if your scheduled process is updating the actual row, but you don't say.
Since the column likely only has two values (null and 1 for changed), an index is probably useless anyway.
Related
I am testing different strategies for a incoming breaking change. The problem is that each experiment would carry some costs in Azure.
The data is huge, and can have some inconsistencies due to many years with fixes and transactions before I even knew the company.
I need to change a column in a table with million of records and dozens of indexes. This will have a big downtime.
ALTER TABLE X ALTER COLUMN A1 decimal(15, 4) --The original column is int
One of the initial ideas (Now I know this is not possible) is to have a secondary replica, do the changes there, and, when changes finish, swap primary with secondary... zero or almost zero downtime. I am referring to a "live", redundant replica, not just a "copy"
EDIT:
Throwing new ideas:
Variations to what have been mentioned in one of the answers: Create a table replica (not the whole DB, just the table), apply a INSERT INTO... SELECT and swap the tables at the end of the process. Or... do the swap early to minimize downtime in trade of a delay during the post-addition of all records from the source
I have tried this, but takes AGES to complete. Also, some null and FK violations make the process to fail after processing for several hours.
"Resuming" could be an option but it makes the process slower with each execution. Without some kind of "Resume", each failure have to be repeated from scratch
An acceptable improvement could be to IGNORE the errors (but create logs, of course) and apply fixes after migration. But afaik, AzureSql (nor SqlServer) doesn't offer an "ignore" option
Drop all indexes, constraints and dependencies to the column that needs to be modified, modify the column and apply all indexes, constraints and dependencies again.
Also tried this one. Some indexes take AGES to complete. But for now seems to be the best bet.
There is a possible variation by applying ROW COMPRESSION before the datatype change, but I think it will not improve the real deal: index re-creation
Create a new column with the target datatype, copy the data from the source column, drop the old column and rename the new one.
This strategy also requires to drop and regenerate indexes, so it will not offer lot of gain (if any) with regards #2.
A friend thought of a variation on this, which is to duplicate the needed indexes ONLINE for the column copy. In the meanwhile, trigger all changes on source column to the column copy.
For any of the mentioned strategies, some gain can be obtained by increasing the processing power. But, anyway, we consider to increase the power with any of the approaches, therefore this is common for all solutions
When you need to update A LOT of rows as a one-time event, maybe it's more effective to use the following migration technique :
create a new target table
use INSERT INTO SELECT to fill the new table with correct / updated values
rename the old and new table
create indexes for the new table
After many tests and backups, we finally used the following aproach:
Create a new column [columnName_NEW] with the desired format change. Allow NULLS
Create a trigger for INSERTS to update the new column with the value in the column to be replaced
Copy the old column value to the new column by batches
This operation is very time consuming. We ran a batch every day in a maintenance window (2h during 4 days). Our batch filled the values taking oldest rows first, we counted on the trigger filling the new ones
Once #3 is complete, don't allow NULLS anymore on the new column, but set a default value to avoid the INSERT trigger to crash
Create all the needed indexes and views on the new column. This is very time consuming but can be done ONLINE
Allow NULLS on the old column
Remove the insert trigger - start downtime now!
Rename the old column to [columnName_OLD], the new to [columnName]. This requires few downtime seconds!
--> You can consider it is finally done!
After some safe time, you can backup the result and remove [columnName_OLD] with all of its dependencies
I selected the other answer, because I think it could be also useful in most situations. This one has more steps but has a very little downtime and is reversible at any step but the last.
I have a database table which have more than 1 million records uniquely identified by a GUID column. I want to find out which of these record or rows was selected or retrieved in the last 5 years. The select query can happen from multiple places. Sometimes the row will be returned as a single row. Sometimes it will be part of a set of rows. there is select query that does the fetching from a jdbc connection from a java code. Also a SQL procedure also fetches data from the table.
My intention is to clean up a database table.I want to delete all rows which was never used( retrieved via select query) in last 5 years.
Does oracle DB have any inbuild meta data which can give me this information.
My alternative solution was to add a column LAST_ACCESSED and update this column whenever I select a row from this table. But this operation is a costly operation for me based on time taken for the whole process. Atleast 1000 - 10000 records will be selected from the table for a single operation. Is there any efficient way to do this rather than updating table after reading it. Mine is a multi threaded application. so update such large data set may result in deadlocks or large waiting period for the next read query.
Any elegant solution to this problem?
Oracle Database 12c introduced a new feature called Automatic Data Optimization that brings you Heat Maps to track table access (modifications as well as read operations). Careful, the feature is currently to be licensed under the Advanced Compression Option or In-Memory Option.
Heat Maps track whenever a database block has been modified or whenever a segment, i.e. a table or table partition, has been accessed. It does not track select operations per individual row, neither per individual block level because the overhead would be too heavy (data is generally often and concurrently read, having to keep a counter for each row would quickly become a very costly operation). However, if you have you data partitioned by date, e.g. create a new partition for every day, you can over time easily determine which days are still read and which ones can be archived or purged. Also Partitioning is an option that needs to be licensed.
Once you have reached that conclusion you can then either use In-Database Archiving to mark rows as archived or just go ahead and purge the rows. If you happen to have the data partitioned you can do easy DROP PARTITION operations to purge one or many partitions rather than having to do conventional DELETE statements.
I couldn't use any inbuild solutions. i tried below solutions
1)DB audit feature for select statements.
2)adding a trigger to update a date column whenever a select query is executed on the table.
Both were discarded. Audit uses up a lot of space and have performance hit. Similary trigger also had performance hit.
Finally i resolved the issue by maintaining a separate table were entries older than 5 years that are still used or selected in a query are inserted. While deleting I cross check this table and avoid deleting entries present in this table.
description
I use Postgres together with python3
There are 17 million rows in the table, the max ID 3000 million+
My task is select id,link from table where data is null;.And do some codes them Update table set data = %s where id = %s.
I tested a single data update needs 0.1s.
my thoughts
The following is my idea
Try a new database, I heard radis soon.But i don't know how to do.
In addition,what is the best number of connections?
I used to made 5-6 connections.
Now only two connections, but better.One hour updated 2million data.
If there is any way you can push the calculation of the new value into the database, i.e. issue a single large UPDATE statement like
UPDATE "table"
SET data = [calculation here]
WHERE data IS NULL;
you would be much faster.
But for the rest of this discussion I'll assume that you have to calculate the new values in your code, i.e. run one SELECT to get all the rows where data IS NULL and then issue a lot of UPDATE statements, each targeting a single row.
In that case, there are two ways how you can speed up processing considerable:
Avoid index updates
Updating an index is more expensive than adding a tuple to the table itself (the appropriately so-called heap, onto which it is quick and easy to pile up entries). So by avoiding index updates, you will be much faster.
There are two ways to avoid index updates:
Drop all indexes after selecting the rows to change and before the UPDATEs and recreate them after processing is completed.
This will be a net win if you update enough rows.
Make sure that there is no index on data and that the tables have been created with a fillfactor of less then 50. Then there is room enough in the data pages to write the update into the same page as the original row version, which obviates the need to update the index (this is known as a HOT update).
This is probably not an option for you, since you probably didn't create the table with a fillfactor like that, but I wanted to add it for completeness' sake.
Bundle many updates in a single transaction
By default, each UPDATE will run in its own transaction, which is committed at the end of the statement. However, each COMMIT forces the transaction log (WAL) to be written out to disk, which slows down processing considerably.
You do that by explicitly issuing a BEGIN before the first UPDATE and a COMMIT after the last one. That will also make the whole operation atomic, so that all changes are undone automatically if processing is interrupted.
I want to place DB2 Triggers for Insert, Update and Delete on DB2 Tables heavily used in parallel online Transactions. The tables are shared by several members on a Sysplex, DB2 Version 10.
In each of the DB2 Triggers I want to insert a row into a central table and have one background process calling a Stored Procedure to read this table every second to process the newly inserted rows, ordered by sequence of the insert (sequence number or timestamp).
I'm very concerned about DB2 Index locking contention and want to make sure that I do not introduce Deadlocks/Timeouts to the applications with these Triggers.
Obviously I would take advantage of DB2 Features to reduce locking like rowlevel locking, but still see no real good approach how to avoid index contention.
I see three different options to select the newly inserted rows.
Put a sequence number in the table and the store the last processed sequence number in the background process. I would do the following select Statement:
SELECT COLUMN_1, .... Column_n
FROM CENTRAL_TABLE
WHERE SEQ_NO > 'last-seq-number'
ORDER BY SEQ_NO;
Locking Level must be CS to avoid selecting uncommited rows, which will be later rolled back.
I think I need one Index on the table with SEQ_NO ASC
Pro: Background process only reads rows and makes no updates/deletes (only shared locks)
Neg: Index contention because of ascending key used.
I can clean-up processed records later (e.g. by rolling partions).
Put a Status field in the table (processed and unprocessed) and change the Select as follows:
SELECT COLUMN_1, .... Column_n
FROM CENTRAL_TABLE
WHERE STATUS = 'unprocessed'
ORDER BY TIMESTAMP;
Later I would update the STATUS on the selected rows to "processed"
I think I need an Index on STATUS
Pro: No ascending sequence number in the index and no direct deletes
Cons: Concurrent updates by online transactions and the background process
Clean-up would happen in off-hours
DELETE the processed records instead of the status field update.
SELECT COLUMN_1, .... Column_n
FROM CENTRAL_TABLE
ORDER BY TIMESTAMP;
Since the table contains very few records, no index is required which could create a hot spot.
Also I think I could SELECT with Isolation Level UR, because I would detect potential uncommitted data on the later delete of this row.
For a Primary Key index I could use GENERATE_UNIQUE,which is random an not ascending.
Pro: No Index hot spot and the Inserts can be spread across the tablespace by random UNIQUE_ID
Con: Tablespace scan and sort on every call of the Stored Procedure and deleting records in parallel to the online inserts.
Looking forward what the community thinks about this problem. This must be a pretty common problem e.g. SAP should have a similar issue on their Batch Input tables.
I tend to favour Option 3, because it avoids index contention.
May be there is still another solution in your minds out there.
I think you are going to have numerous performance problems with your various solutions.
(I know premature optimazation is a sin, but experience tells us that some things are just not going to work in a busy system).
You should be able to use DB2s autoincrement feature to get your sequence number, with little or know performance implications.
For the rest perhaps you should look at a Queue based solution.
Have your trigger drop the operation (INSERT/UPDATE/DELETE) and the keys of the row into a MQ queue,
Then have a long running backgound task (in CICS?) do your post processing as its processing one update at a time you should not trip over yourself. Having a single loaded and active task with the ability to batch up units of work should give you a throughput in the order of 3 to 5 hundred updates a second.
I have a table that contains over than a million records (products).
Now, daily, I need to either update existing records, and/or add new ones.
Instead of doing it one-by-one (takes couple of hours), I managed to use SqlBulkCopy to work with bunch of records and managed to do my inserts in the matter of seconds, but it can handle only new inserts. So I am thinking about creating a new table that contains new records and old records; and then use that temporary table (on the SQL end) to update/add to the main table.
Any advice how can I perform that update?
One of the better ways to handle this is with the MERGE command in SQL. Mssqltips has a good tutorial on it, it can be a bit trickier to use than some of the other commands.
Also, due to locking you may want to break this up into multiple smaller transactions, unless you know you can tolerate blocking during the update.
We handle this situation in our code in the way you described; we have a temp table, then run an update where the ID in the temp table matches the table to be updated, then run an insert where the ID in the table to be updated is null. We normally do this for updates to library/program settings, though, so it is only run infrequently, on smaller tables. Performance may not be up to par for that many records, or daily runs.
The main "gotcha" I've encountered with this method is that for the update, we did a comparison to make sure at least one of several fields changed before actually running the update. (Our initial reason for this was to avoid overwriting some defaults, which could affect server behavior. Your reason for this might be performance, if your temp table could contain records that haven't actually changed). We encountered a case where we did actually want to update one of the defaults, but our old script didn't catch that. So if you do any comparisons to determine which products you want to update, make sure it is either complete from the start, or document well any fields you don't compare, and why.