Synchronize data from Oracle to PostgreSQL - database

We would like to synchronize data (insert, update) from Oracle (11g) to PostgreSQL (10). Our approach was the following:
A trigger on the table in Oracle updates a column with nextval from a sequence before insert and update.
PostgreSQL knows the last sequence number processed and fetches the rows from Oracle > lastSequenceNumberFetched.
We now have the following problem:
Session 1 in Oracle inserts a row, sequence number (let's say 45) is written but no COMMIT is done in Oracle.
Session 2 in Oracle inserts a row, sequence number is written (let's say 49 (because sequences in Oracle can have gaps)) and a COMMIT is done in Oracle.
Session in PostgreSQL fetches rows from Oracle with sequenceNumber > 44 (because the lastSequenceNumberFetched is 44) and gets the row with sequenceNumber 49. So this is the new lastSequenceNumberFetched.
Session 1 in Oracle makes a commit.
Session in PostgreSQL fetches rows from Oracle with sequenceNumber > 49. Problem is that the row with sequenceNumber 45 is never fetched.
Are there any better approaches for our use case avoiding our problem with missing data?

In case you don't have delete operations in your tables and the tables are not very big then I suggest to use Oracle System Change Number (SCN) on the row level which is returned by the pseudo column ORA_ROWSCN (link). This is the commit time presented by number. By default the SCN is tracked for the data block, but you can enable tracking on the row level (keyword rowdependencies). So you have to recreate your table with this keyword. At the sync procedure launch you get the current scn by the function call dbms_flashback.get_system_change_number, then scan all tables where ora_rowscn between _last_scn_value_ and _current_scn_value_. The disadvantage is that this pseudo columns is not indexed, so you will have full table scans, which is slow for big tables.
If you use delete statements then you have to track the records which were deleted. For this purpose you can use one log table having the following columns: table_name, table_id_value, operation (insert/update/delete). Table is filled by the trigger on base tables. So for your case when session 1 commits data in base table - then you have the record in log table to process. And you don't see it until the session commits. So no issues with sequence numbers that you described.
Hope that helps.

Is this purely a data project or do you have some client here. If you do have a middle tier you could use an ORM to abstract some of this and do writes to both. Do you care whether the sequences are the same? It would be possible to do something like collect all the data to synchronize since a particular timestamp (every table would have to have a UTC timestamp) and then take a hash of all the data and compare with what is in Postgres.
It might be useful to have some more of your requirements for the synchronization of data and the reasoning behind this e.g.
Do the keys need to be the same against both environments? Why?
Who views the data, is the same consumer looking at both sources.
Why wouldn't you just use an ORM to target only one db why do you need oracle and postgres?

I have seen a similar setup. An application on Postgres mostly for reporting and other secondary tasks while main app was on Oracle.
Some of the main app tables are cached in Postgres for convenience. But this setup brings in the sync problem.
The compromise solution was a mix of incremental sequence-based sync during daytime and full table copy overnight
Regarding other solutions proposed here:
Postgres fdw is slow for complex queries and it puts extra load on foreign db especially when where clause refer to both local and foreign tables.
The same query will run much faster if foreign table is cached in postgres.
Incremental/differential sync using sequence numbers -tried this and works acceptable for small tables, but the nightmare starts with child relations maybe an orm can help here
The ideal solution in my opinion would probably be to stream Oracle changes to Postgres or intermediary process that replicates changes to Postgres
I have no clue about how to do this as I understood it requires Oracle golden gate app (+ licence)

Related

Oracle database table delete best practices

Environment: Oracle 12C
Got a table with about 10 columns which include few clob and date columns. This is a very busy table for an ETL process as described below-
Flat files are loaded into the table first, then updated and processed. The insert and updates happen in batches. Millions of records are inserted and updated.
There is also a delete process to delete old data based on a date field from the table. The delete process runs as a pl/sql procedure and deletes from the table in a loop fetching first n records only based on date field.
I do not want the delete process to interfere with the regular insert/update . What is the best practice to code the delete so that it has minimal impact on the regular insert/update process ?
I can also partition the table and delete in parallel since each partition uses its own rollback segment but am looking for a simpler way to tune the delete process.
Any suggestions on using a special rollback segment or other tuning tips ?
The first thing you should look for is to decouple various ETL processes so that you need not do all of them together or in a particular sequence. Thereby, removing the dependency of the INSERTS/UPDATES and the DELETES. While a insert/update you could manage in single MERGE block in your ETL, you could do the delete later by simply marking the rows to be deleted later, thus doing a soft delete. You could do this as a flag in your table column. And use the same in your application and queries to filter them out.
By doing the delete later, your critical path of the ETL should minimize. Partitioning the data based on date range should definitely help you to maintain the data and also make the transactions efficient if it's date driven. Also, look for any row-by-row thus slow-by-slow transactions and make them in bulk. Avoid context switching between SQL and PL/SQL as much as possible.
If you partition the table as a date range, then you could look into DROP/TRUNCATE partition which will discard the rows stored in that partition as a DDL statement. This cannot be rolled back. It executes quickly and uses few system resources (Undo and Redo). You can read more about it in the documentation.

MERGE vs DROP Table and Rebuild Indexes in SQL Server

I have a "log shipped" copy of a database that lives at a third party. Log shipping runs every 15 minutes at which time all connections to the database are dropped. The database is used for reporting purposes.
I have decided to pull some of the data from the log shipped (read only) database into a new database refreshed nightly. This will allow users to connect to the new database without risk of loosing connectivity due to log shipping. (It also allows more granular security permissions to be used, since the read-only copy can't be edited)
I can think of 2 patterns to accomplish this.
Drop Table, Create Table, Create Indexes
use the MERGE statement to insert/update/delete records
I have implemented the solution using method 1 above, and it works just fine.
It feels a bit heavy to me to drop all of the data every day. Are there any side effects to method 1 above that should push me over to using method 2?
To provide a sense of scale, I am only syncing 3 tables,
Table 1 - 38 columns - 13,110 rows
Table 2 - 82 columns - 17,421 rows
Table 3 - 22 columns - 249 rows
The resulting database is ~1.3 GB. (There are some other tables in there as well)
I would appreciate guidance on Method 1 vs 2, and whether there is another method that I am not thinking about.
TRUNCATE and INSERT would be more efficient than either dropping or merging.

Alternative Method to Polling/Trigger a Table in Oracle?

I have a db on Oracle 11g where there's a table updated by external users. Now I want to catch the insert/update/delete on this table in order to bring these changes on a table on another db and I'm trying different methods for research. I tested polling (a job to check every minute if there is an update, insert or delete on the table) and trigger (fire on update, insert or delete on the table) yet, so are there alternative methods?
I found AOQ (Oracle Advanced Queuing), DBMS_PIPE, Oracle SNMP Agent Integrator Polling Activity, but I don't know if they are right for this case...
It depends.
Polling or triggers are often all you need depending on the volume of data involved, and the frequency of inserts/updates/deletes.
For example, the polling method might be as simple as adding a column which is set to 1 by default, and updated to NULL when the row is "consumed" by the replication code. A trigger on the table would set it back to 1 if a row is updated. An index on this column would be lightweight (the index would only include entries for rows where the column is 1) and therefore fast to query. You'd need another table to handle deletes, though.
The trigger method would merely write insert/update/delete rows into a log table of some sort, which would then get purged periodically by a job.
For heavier volumes solutions include Oracle GoldenGate and Oracle Streams: http://www.oracle.com/technetwork/database/focus-areas/data-integration/index.html

Large Data Service Architecture

Everyday a company drops a text file with potentially many records (350,000) onto our secure FTP. We've created a windows service that runs early in the AM to read in the text file into our SQL Server 2005 DB tables. We don't do a BULK Insert because the data is relational and we need to check it against what's already in our DB to make sure the data remains normalized and consistent.
The problem with this is that the service can take a very long time (hours). This is problematic because it is inserting and updating into tables that constantly need to be queried and scanned by our application which could affect the performance of the DB and the application.
One solution we've thought of is to run the service on a separate DB with the same tables as our live DB. When the service is finished we can do a BCP into the live DB so it mirrors all of the new records created by the service.
I've never worked with handling millions of records in a DB before and I'm not sure what a standard approach to something like this is. Is this an appropriate way of doing this sort of thing? Any suggestions?
One mechanism I've seen is to insert the values into a temporary table - with the same schema as the target table. Null IDs signify new records and populated IDs signify updated records. Then use the SQL Merge command to merge it into the main table. Merge will perform better than individual inserts/updates.
Doing it individually, you will incur maintenance of the indexes on the table - can be costly if its tuned for selects. I believe with merge its a bulk action.
It's touched upon here:
What's a good alternative to firing a stored procedure 368 times to update the database?
There are MSDN articles about SQL merging, so Googling will help you there.
Update: turns out you cannot merge (you can in 2008). Your idea of having another database is usually handled by SQL replication. Again I've seen in production a copy of the current database used to perform a long running action (reporting and aggregation of data in this instance), however this wasn't merged back in. I don't know what merging capabilities are available in SQL Replication - but it would be a good place to look.
Either that, or resolve the reason why you cannot bulk insert/update.
Update 2: as mentioned in the comments, you could stick with the temporary table idea to get the data into the database, and then insert/update join onto this table to populate your main table. The difference is now that SQL is working with a set so can tune any index rebuilds accordingly - should be faster, even with the joining.
Update 3: you could possibly remove the data checking from the insert process and move it to the service. If you can stop inserts into your table while this happens, then this will allow you to solve the issue stopping you from bulk inserting (ie, you are checking for duplicates based on column values, as you don't yet have the luxury of an ID). Alternatively with the temporary table idea, you can add a WHERE condition to first see if the row exists in the database, something like:
INSERT INTO MyTable (val1, val2, val3)
SELECT val1, val2, val3 FROM #Tempo
WHERE NOT EXISTS
(
SELECT *
FROM MyTable t
WHERE t.val1 = val1 AND t.val2 = val2 AND t.val3 = val3
)
We do much larger imports than that all the time. Create an SSIS pacakge to do the work. Personally I prefer to create a staging table, clean it up, and then do the update or import. But SSIS can do all the cleaning in memory if you want before inserting.
Before you start mirroring and replicating data, which is complicated and expensive, it would be worthwhile to check your existing service to make sure it is performing efficiently.
Maybe there are table scans you can get rid of by adding an index, or lookup queries you can get rid of by doing smart error handling? Analyze your execution plans for the queries that your service performs and optimize those.

Bulkcopy inserts with DBCC CheckIdent

Our team needs to insert a cruel amount of data into our SQL Server 2008 database. We're looking for a good solution. Now we came up with one, but I have doubts with it, simply because it doesn't feel right. So I'm asking here if this seems like a good solution. Extra challange is that it's a peer-to-peer replicated database over 4 servers! :)
Imagine we have 1 million rows to insert
Start transaction
Increase current ident value on a table with 1 million
Have a DataSet/DataTable ready with 1 million rows and the correct ids
BulkCopy the data into the database
Commit transaction
Is this a good solution, might we get into concurrency issues, have too large transactions, etc.
you'll only get problems (as far as I can see, so there might be things I overlook!) if the database is online and users can insert rows into that table. Increasing the identity value for new rows on the meta-level simply means that the next row inserted by the system will use that number, so if you bump it with 1 million, it means you reserved those numbers up front.
Identity columns are 'nice' but have the side effect that they're not transferable. So if you have to migrate the data to another DB, realize that you likely have to adjust the data inserted to match the database you insert it in (as that's the scope of the data which means identity fields could collide with rows already in the table).
If this is a one-time affair, it might work out. If you're planning to do this regularly, I'd look into a more higher-level migration system where you migrate the data to new identity values or use guid's with NEWSEQUENTIALID() so you get proper checked indexes and also unique, transferable id's.

Resources