I have to make C application with OCI which retrieve new rows from database, I mean: rows added in time from last session to current. ora_rowscn is not solution: this value is changed for blocks so that few different rows can have same SCN.
On example I have table with dates:
03.05.2015
05.05.2015
07.05.2015
I can make structure:
struct Bounds {
Timestamp start, end;
};
03.05.2015 is as start and 07.05.2015 is as end.
Checking rows after Bounds.end is simple. But it could be some delay or transaction after my last query and I can have new values.
03.05.2015
04.05.2015
05.05.2015
06.05.2015
07.05.2015
These new rows count can be detected by query (STARD and END are values of structure):
select count(*) from logs where log_time > START and log_time < END
Then I have 3 rows and 5 after it. My application have only read persmission.
Oracle database is concurrent environment. So generally there is no way how to tell what is the "last" inserted row because there technically is no last inserted row.
AFAIK you have two options
Use Continuous Query Notification. This bypasses SQL query interface and uses special API dedicated for this particular purpose.
The other option is to query current databases SCN and start a transaction with this SCN. See OCIStmtExecute, this function has two parameters snap_in/snap_out. Theoretically you can use them to track you view on databases SCN. But I'm not sure I never used that.
In Oracle readers do not block writers and vice-versa.
So a row inserted on 06.05.2015 (but commited on 08.05.2015) will be visible AFTER 7.5.2015. Oracle is parallel database and it does not guarantee any serialization.
Maybe if you used row level ora_rowsncn, then it would work. But this requires redefinition of the source table.
Related
If an ETL process attempts to detect data changes on system-versioned tables in SQL Server by including rows as defined by a rowversion column to be within a rowversion "delta window", e.g.:
where row_version >= #previous_etl_cycle_rowversion
and row_version < #current_etl_cycle_rowversion
.. and the values for #previous_etl_cycle_rowversion and #current_etl_cycle_rowversion are selected from a logging table whose newest rowversion gets appended to said logging table at the start of each ETL cycle via:
insert into etl_cycle_logged_rowversion_marker (cycle_start_row_version)
select ##DBTS
... is it possible that a rowversion of a record falling within a given "delta window" (bounded by the 2 ##DBTS values) could be missed/skipped due to rowversion's behavior vis-à-vis transactional consistency? - i.e., is it possible that rowversion would be reflected on a basis of "eventual" consistency?
I'm thinking of a case where say, 1000 records are updated within a single transaction and somehow ##DBTS is "ahead" of the record's committed rowversion yet that specific version of the record is not yet readable...
(For the sake of scoping the question, please exclude any cases of deleted records or immediately consecutive updates on a given record within such a large batch transaction.)
If you make sure to avoid row versioning for the queries that read the change windows you shouldn't miss many rows. With READ COMMITTED SNAPSHOT or SNAPSHOT ISOLATION an updated but uncommitted row would not appear in your query.
But you can also miss rows that got updated after you query ##dbts. That's not such a big deal usually as they'll be in the next window. But if you have a row that is constantly updated you may miss it for a long time.
But why use rowversion? If these are temporal tables you can query the history table directly. And Change Tracking is better and easier than using rowversion, as it tracks deletes and optionally column changes. The feature was literally built for to replace the need to do this manually which:
usually involved a lot of work and frequently involved using a
combination of triggers, timestamp columns, new tables to store
tracking information, and custom cleanup processes
.
Under SNAPSHOT isolation, it turns out the proper function to inspect rowversion which will ensure contiguous delta windows while not skipping rowversion values attached to long-running transactions is MIN_ACTIVE_ROWVERSION() rather than ##DBTS.
I have a database table which have more than 1 million records uniquely identified by a GUID column. I want to find out which of these record or rows was selected or retrieved in the last 5 years. The select query can happen from multiple places. Sometimes the row will be returned as a single row. Sometimes it will be part of a set of rows. there is select query that does the fetching from a jdbc connection from a java code. Also a SQL procedure also fetches data from the table.
My intention is to clean up a database table.I want to delete all rows which was never used( retrieved via select query) in last 5 years.
Does oracle DB have any inbuild meta data which can give me this information.
My alternative solution was to add a column LAST_ACCESSED and update this column whenever I select a row from this table. But this operation is a costly operation for me based on time taken for the whole process. Atleast 1000 - 10000 records will be selected from the table for a single operation. Is there any efficient way to do this rather than updating table after reading it. Mine is a multi threaded application. so update such large data set may result in deadlocks or large waiting period for the next read query.
Any elegant solution to this problem?
Oracle Database 12c introduced a new feature called Automatic Data Optimization that brings you Heat Maps to track table access (modifications as well as read operations). Careful, the feature is currently to be licensed under the Advanced Compression Option or In-Memory Option.
Heat Maps track whenever a database block has been modified or whenever a segment, i.e. a table or table partition, has been accessed. It does not track select operations per individual row, neither per individual block level because the overhead would be too heavy (data is generally often and concurrently read, having to keep a counter for each row would quickly become a very costly operation). However, if you have you data partitioned by date, e.g. create a new partition for every day, you can over time easily determine which days are still read and which ones can be archived or purged. Also Partitioning is an option that needs to be licensed.
Once you have reached that conclusion you can then either use In-Database Archiving to mark rows as archived or just go ahead and purge the rows. If you happen to have the data partitioned you can do easy DROP PARTITION operations to purge one or many partitions rather than having to do conventional DELETE statements.
I couldn't use any inbuild solutions. i tried below solutions
1)DB audit feature for select statements.
2)adding a trigger to update a date column whenever a select query is executed on the table.
Both were discarded. Audit uses up a lot of space and have performance hit. Similary trigger also had performance hit.
Finally i resolved the issue by maintaining a separate table were entries older than 5 years that are still used or selected in a query are inserted. While deleting I cross check this table and avoid deleting entries present in this table.
I have a Table in sql server consisting of 200 million records in two different servers. I need to move this table from Server 1 to Server 2.
Table in server 1 can be a subset or a superset of the table in server 2. Some of the records(around 1 million) in server 1 are updated which I need to update in server 2. So currently I am following this approach :-
1) Use SSIS to move data from server 1 to staging database in server 2.
2) Then compare data in staging with the table in server 2 column by column. If any of the column is different, I update the whole row.
This is taking a lot of time. I tried using hashbytes inorder to compare rows like this:-
HASHBYTES('sha',CONCAT(a.[account_no],a.[transaction_id], ...))
<>
HASHBYTES('sha',CONCAT(b.[account_no],b.[transaction_id], ...))
But this is taking even more time.
Any other approach which can be faster and can save time?
This is a problem that's pretty common.
First - do not try and do the updates directly in SQL - the performance will be terrible, and will bring the database server to its knees.
In context, TS1 will be the table on Server 1, TS2 will be the table on Server 2
Using SSIS - create two steps within the job:
First, find the deleted - scan TS2 by ID, and any TS2 ID that does not exist in TS1, delete it.
Second, scan TS1, and if the ID exists in TS2, you will need to update that record. If memory serves, SSIS can inspect for differences and only update if needed, otherwise, just execute the update statement.
While scanning TS1, if the ID does not exist in TS2, then insert the record.
I can't speak to performance on this due to variations in schemas as servers, but it will be compute intensive to analyze the 200mm records. It WILL take a long time.
For on-going execution, you will need to add a "last modified date" timestamp to each record and a trigger to update the field on any legitimate change. Then use that to filter out your problem space. The first scan will not be terrible, as it ONLY looks at the IDs. The insert/update phase will actually benefit from the last modified date filter, assuming the number of records being modified is small (< 5%?) relative to the overall dataset. You will also need to add an index to that column to aid in the filtering.
The other option is to perform a burn and load each time - disable any constraints around TS2, truncate TS2 and copy the data into TS2 from TS1, finally reenabling the constraints and rebuild any indexes.
Best of luck to you.
description
I use Postgres together with python3
There are 17 million rows in the table, the max ID 3000 million+
My task is select id,link from table where data is null;.And do some codes them Update table set data = %s where id = %s.
I tested a single data update needs 0.1s.
my thoughts
The following is my idea
Try a new database, I heard radis soon.But i don't know how to do.
In addition,what is the best number of connections?
I used to made 5-6 connections.
Now only two connections, but better.One hour updated 2million data.
If there is any way you can push the calculation of the new value into the database, i.e. issue a single large UPDATE statement like
UPDATE "table"
SET data = [calculation here]
WHERE data IS NULL;
you would be much faster.
But for the rest of this discussion I'll assume that you have to calculate the new values in your code, i.e. run one SELECT to get all the rows where data IS NULL and then issue a lot of UPDATE statements, each targeting a single row.
In that case, there are two ways how you can speed up processing considerable:
Avoid index updates
Updating an index is more expensive than adding a tuple to the table itself (the appropriately so-called heap, onto which it is quick and easy to pile up entries). So by avoiding index updates, you will be much faster.
There are two ways to avoid index updates:
Drop all indexes after selecting the rows to change and before the UPDATEs and recreate them after processing is completed.
This will be a net win if you update enough rows.
Make sure that there is no index on data and that the tables have been created with a fillfactor of less then 50. Then there is room enough in the data pages to write the update into the same page as the original row version, which obviates the need to update the index (this is known as a HOT update).
This is probably not an option for you, since you probably didn't create the table with a fillfactor like that, but I wanted to add it for completeness' sake.
Bundle many updates in a single transaction
By default, each UPDATE will run in its own transaction, which is committed at the end of the statement. However, each COMMIT forces the transaction log (WAL) to be written out to disk, which slows down processing considerably.
You do that by explicitly issuing a BEGIN before the first UPDATE and a COMMIT after the last one. That will also make the whole operation atomic, so that all changes are undone automatically if processing is interrupted.
I am using SQL CDC to track changes for multiple tables in SQL Server. I would want to report out these changes in right sequence for each I have a program which collects the data from each CDC table. But I want to make sure that all the changes that are happening to these tables are reported in correct sequence. Can I rely on LSN for the right sequence?
The LSN number is unique for a given transaction but is not globally unique. If you have multiple records within the same transaction they will all share the same __$start_lsn value in cdc. If you want the correct order of operations you need to sort by __$start_lsn, __$seqval, then __$operation. The __$seqval represents the id of the individual operation within the wrapping transaction.
For example, I have a table in the dbo schema named foo. It has one column y. If I run this statement:
INSERT INTO dbo.foo VALUES (1);
INSERT INTO dbo.foo VALUES (2);
Then I will see two separate LSN values in cdc because these are in two separate transactions. If I run this:
BEGIN TRAN
INSERT INTO dbo.foo VALUES (1);
INSERT INTO dbo.foo VALUES (2);
COMMIT TRAN
Then I will see one LSN value for both records, but they will have different __$seqval values, and the seqval for my first record will be less than the seqval for my second record.
LSN is unique, ever increasing within the database, across all tables in that database.
In most cases LSN value is unique across all tables, however I found instances where one single LSN value belongs to the changes in 40 tables. I don't know the SQL script that associated with those changes, but I know that all operations were 'INSERT'.
Not sure if it is a bug. CDC documentations is poor, covers just basics. Not many users know that CDC capture process has many bugs confirmed by MS for both SQL 2014 & 2016 (we have the open case).
So I would not rely on the documentation. It may be wrong in some scenarios. It's better to implement more checks and test it with large volume of different combinations of changes.
I also encountered that scenario. In my experience and what I understood is in your first example, there are 2 transactions happened so you will really get 2 different LSN. While in your second example, you only have 1 transaction with 2 queries inside. The CDC will count it as only 1 transaction since it is inside BEGIN and END TRAN. I can't provide links to you since this is my personal experience.