Once a day I have to synchronize table between two databases.
Source: Microsoft SQL Server
Destination: PostgreSQL
Table contains up to 30 million rows.
For the first time i will copy all table, but then for effectiveness my plan is to insert/update only changed rows.
In this way if I delete row from source database, it will not be deleted from the destination database.
The problem is that I don’t know which rows were deleted from the source database.
My dirty thoughts right now tend to use binary search - to compare the sum of the rows on each side and thus catch the deleted rows.
I’m at a dead end - please share your thoughts on this...
In SQL Server you can enable Change Tracking to track which rows are Inserted, Updated, or Deleted since the last time you synchronized the tables.
with TDS FDWs (Foreign Data Wrapper), map the source table with a temp table in pg, an use a join to find/exclude the rows that you need.
Related
I have gone through a bunch of documentation for PostgresSQL 10 partitioning but I am still not clear on whether existing tables can be partitioned. Most of the posts mention about partitioning existing tables using PostgreSQL 9.
Also, in the official PostgresSQL website : https://www.postgresql.org/docs/current/static/ddl-partitioning.html, it mentions 'It is not possible to turn a regular table into a partitioned table or vice versa'.
So, my question is can existing tables be partitioned in PostgreSQL 10?
If the answer is YES, my plan is :
Create a partitions
Alter the existing table to include the range so new data goes into the new partition. Once that is done, write a script which loops over the master table and moves the data into the right partitions.
Then, truncate the master table and enforce that nothing can be inserted into it.
If the answer is NO, my plan is to make the existing table the first partition
Create a new parent table and children(partitions).
Perform light transaction which will rename the existing table to a partition table name and the new parent to the actual table name.
Are there better ways to partition existing tables in PostgreSQL 10/9?
are there any information in the net, where i can verify how hight are the storage costs for temporal tables feature?
Will the server creates a the full hardcopy of the row/tuple that was modified?
Or will the server use a reference/links to the original values of the master table that are not modified?
For example. I have a row with 10 columns = storage 100 KB. I change one value of that row, thow times. I have thow rows in the historical table after that changes. Is the fill storage cost for the master und historial table then ~300KB?
Thanks for every hint!
Ragards
Will the server creates a the full hardcopy of the row/tuple that was
modified? Or will the server use a reference/links to the original
values of the master table that are not modified?
Here is the cite of the book Pro SQL Server Internals
by Dmitri Korotkevitch that ansers your question:
In a nutshell, each temporal table consists of two tables — the
current table with the current data, and a history table that stores
old versions of the rows. Every time you modify or delete data in
the current table, SQL Server adds an original version of those rows
to the history table.
A current table should always have a primary key defined. Moreover,
both current and history tables should have two datetime2 columns,
called period columns, that indicate the lifetime of the row. SQL
Server populates these columns automatically based on transaction
start time when the new versions of the rows were created. When a row
has been modified several times in one transaction, SQL Server does
not preserve uncommitted intermediary row versions in the history
table.
SQL Server places the history tables in a default filegroup, creating
non-unique clustered indexes on the two datetime2 columns that
control row lifetime. It does not create any other indexes on the
table.
In both the Enterprise and Developer Editions, history tables use
page compression by default.
So it's not
reference/links to the original values of the master table
Previous row version is just copied as it is into historical table on every mofification.
I have a historical data table like (Date,ItemId,Price). Normally around 60,000 records will be inserted into the table. Now, the table record amount is around 3 millions. And our query is something like select 2000 products in 3 months which is very slow in present. I already make some indexes for it , but I still want more better performance.
For this situation, how can I do can make the query faster? Table partitioning or Caching ?
Thanks
Please specify the the version of SQL Server you are using? Partitioning only works with Enterprise edition.
To improve performance, you may make use of temporary tables, i.e. create temporary table on a subset (rows and columns) of data which you require. Temporary table would be smaller than original table, further they can be indexed also if required. This subset of data stored in temp tables can also be cached thereby increasing performance.
I have a CDC process setup, whereby TableA's additional rows (or updates) are automatically picked up by an ETL and put into a TableB
TableA >>CDC>> TableB
The CDC works fine, except I want to update the first table once the CDC process is finished. I want to update the table by populating it with the
"extraction date". So my tableA has, lets say: Name, Age, OtherInfo, ExtractionDate. CDC is setup on Name,Age and OtherInfo columns (extractionDate column is excluded for obvious reasons).
Then, once CDC is performed on TableA and it's taken to TableB, I'd like to populate TableA's "extractionDate" with the current date. However, given I do not know which rows are being moved, I am having difficulty populating the column. Specifically, how can I make a "selective" where clause to select the "changed" rows, when that's only known to SSIS.
In the Table A database there are system tables that were created as part enabling CDC. You should be able to easily find the table associated with Table A. This is where MSSQL keeps track of all the changes.
The __$start_lsn is a timestamp of when the change was made and your SSIS imports use this value to bring across a range of changes. The lsn_time_mapping lets you look up the timestamp so it easier to understand.
In my processing I store the start and end lsn values so I know what was brought across with each SSIS run. I could then use these lsn values to go back to this CDC source table and see all the changes that MSSQL has tracked during that time-span.
Keep in mind that the CDC system tables are automatically cleaned out every few days - so you wouldn't be able to applyt this logic historically - only for recent imports.
I have a database with 51 tables all with the same schema (one table per state). Each table has a couple million rows and about 50 columns.
I've normalized the columns into 6 other tables, and now I want to import all of the data from those 51 tables into the 6 new tables. The column names are all the same, and so I'm hoping I can automate the process of importing all the data.
I'm assuming what I'll need to do is:
Select the names of all the lists that have the raw schema
SELECT TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'raw'
Iterate over all the results
Grab all rows from that table, and SELECT INTO the appropriate cols into the appropriate tables
Delete row from raw table
Is there anything I'm missing? Also, is there any way to have this run on the SQL Server so I don't have to have my SQL Server Management Studio open the whole time?
Yes, obviously, you can automate it with t-sql. But I recommened you to use SSIS in this case. As you say, structure of all tables are the same than you can make some ETL process and then you just change table name in the source. Consecuently, you will have the folowwing advantages:
Solve issue with couple of clicks
Low risk of errors
You will able to use the number of data transformations