How various databases implement copying data (replication) to a new instance when it is added to the replication setup?
I. e., when we add a new instance, how is the data loaded into it?
There are a lot of information about ways of replication, but they are explained in cases when the target database instance already has the same data from its source. But not when there is a new initially empty instance of database
There are basically 3 approaches here.
First you start capturing the changes from the source database using CDC tool. Since the target database is not yet created you store all the changes to apply them later.
Depending on the architecture you can:
If you have 1:1 copy
Take a backup of the source database, backup it, and restore it to the target database. Having some point in time of the backup you start applying the changes from the timestamp when the database backup was created.
Assuming you have a consistent backup of the database you would have the same data on the target but delayed compared to the source.
If you have a subset of the tables or a different vendor
The same approach like in 1. but you don't backup & restore the full database but just a list of tables. You can also restore the database backup in a temporary location, export part of the tables (or not full tables but just subset of columns), and next load them to the target.
When the target is initially prepared - you start applying the changers from the source to the target.
No source database snapshot available
If you can't get a snapshot of the replication tool often contains a method to work with that. Depending on the tool the function is named AUTOCORRECTION (SAP/Sybase Replication Server), HANDLECOLLISIONS (Oracle GoldenGate). This method basically means that the replication tool has a full image of the UPDATE operation, and when the record does not exist in the target - it is cerated. When the row for DELETE does not exists - the operation is ignored. When the rows already exists for INSERT - the operation is ignored.
To get a consistent state of the target you work in mode described here for some time until the point when you have data in sync, and next switch to regular replication.
One thing to mention about this mode is that you need to make sure that during the reconciliation operation the CDC must provide full UPDATE content for rows. If the UPDATE just contains modified columns - you would not be able to create INSERT command (with all column values) if the row is missing.
Of course the replication tool you use can incorporate the solution described above and do the task instead of you - automatically.
Related
I have an application that is in production with its own database for more than 10 years.
I'm currently developing a new application (kind of a reporting application) that only needs read access to the database.
In order not to be too much linked to the database and to be able to use newer DAL (Entity Framework 6 Code First) I decided to start from a new empty database, and I only added the tables and columns I need (different names than the production one).
Now I need some way to update the new database with the production database regularly (would be best if it is -almost- immediate).
I hesitated to ask this question on http://dba.stackexchange.com but I'm not necessarily limited to only using SQL Server for the job (I can develop and run some custom application if needed).
I already made some searches and had those (part-of) solutions :
Using Transactional Replication to create a smaller database (with only the tables/columns I need). But as far as I can see, the fact that I have different table names / columns names will be problematic. So I can use it to create a smaller database that is automatically replicated by SQL Server, but I would still need to replicate this database to my new one (it may avoid my production database to be too much stressed?)
Using triggers to insert/update/delete the rows
Creating some custom job (either a SQL Job or some Windows Service that runs every X minutes) that updates the necessary tables (I have a LastEditDate that is updated by a trigger on my tables, so I can know that a row has been updated since my last replication)
Do you some advice or maybe some other solutions that I didn't foresee?
Thanks
I think that the Transactional replication is the better than using triggers.
Too much resources would be used in source server/database due to the trigger fires by each DML transaction.
Transactional rep could be scheduled as a SQL job and run it few times a day/night or as a part of nightly scheduled job. IT really depends on how busy the source db is...
There is one more thing that you could try - DB mirroring. it depends on your sql server version.
If it were me, I'd use transactional replication, but keep the table/column names the same. If you have some real reason why you need them to change (I honestly can't think of any good ones and a lot of bad ones), wrap each table in a view. At least that way, the view is the documentation of where the data is coming from.
I'm gonna throw this out there and say that I'd use Transaction Log shipping. You can even set the secondary DBs to read-only. There would be some setting up for full recovery mode and transaction log backups but that way you can just automatically restore the transaction logs to the secondary database and be hands-off with it and the secondary database would be as current as your last transaction log backup.
Depending on how current the data needs to be, if you only need it done daily you can set up something that will take your daily backups and then just restore them to the secondary.
In the end, we went for the Trigger solution. We don't have that much changes a day (maybe 500, 1000 top), and it didn't put too much pressure on the current database. Thanks for your advices.
I've a bit of a strange problem with a BACPAC I took last night using the SQL Azure Import/Export Service.
In our database there are 2 related tables.
dbo.Documents --All Documents in the database
Id
DocName
Extension
dbo.ProcessDocuments --Doc's specific to a process
Id
DocumentId (FK -> dbo.Documents.Id with Check Constraint)
ProcessId
Based on that Schema it should not be possible for the ProcessDocuments table to include a row that does not have a companion entry in the main Documents table.
However after I did the restore of the database in another environment I ended up with
7001 entries in ProcessDocuments. Only 7000 equivalent entries for them in Documents (missing 1). And the restore failed on attempting to restore the ALTER TABLE CHECK CONSTRAINT on ProcessDocuments
The only thing I can imagine is that when the backup was being taken, it was sequentially (alphabetically???) going through the tables, and backing up the data 1 table at a time and something like the following happened.
Documents gets backed up. Contains 7000 entries
Someone adds a new process document to the system / Insert to Documents & Process Documents
ProcessDocuements gets backed up. Contains 7001 entries
If that's the case, then it creates a massive problem in terms using BACPACs as a valid disaster recovery asset, because if they're taken while the system has data in motion, it's possible that your BACPAC contains data integrity issues.
Is this the case, or can anyone shed any light on what else could have caused this ?
Data export uses bulk operations on the DB and is NOT guaranteed to be transactional, so issue like you described can and eventually will happen.
"An export operation performs an individual bulk copy of the data from each table in the database so does not guarantee the transactional consistency of the data. You can use the Windows Azure SQL Database copy database feature to make a consistent copy of a database, and perform the export from the copy."
http://msdn.microsoft.com/en-us/library/windowsazure/hh335292.aspx
if you want to create transactionally consistent backups you have to copy the DB first (which may cost you a lot, depending on size of your db) and then export a copied DB as BACPAC (as ramiramilu pointed out) http://msdn.microsoft.com/en-us/library/windowsazure/jj650016.aspx
you can do it yourself or use RedGate SQL Azure Backup but from what I understand they follow exactly the same steps as described above, so if you choose their consistent backup option it's gonna cost you as well.
As per the answer from Slav, the bacpac is non-transactional and will be corrupted if any new rows are added to any table while the bacpac is being generated.
To avoid this:
1) Copy the target database, which will return straight away, but the database will take some time to copy. This operation will create a full transactional copy:
CREATE DATABASE <name> AS COPY OF <original_name>
2) Find the status of your copy operation:
SELECT * FROM sys.dm_database_copies
3) Generate a bacpac file on the copied database, which isn't being used by anyone.
4) Delete the copied database, and you'll have a working bacpac file.
To allow more realistic conditions during development and testing, we want to automate a process to copy our SQL Server 2008 databases down from production to developer workstations. Because these databases range in size from several GB up to 1-2 TB, it will take forever and not fit onto some machines (I'm talking to you, SSDs). I want to be able to press a button or run a script that can clone a database - structure and data - except be able to specify WHERE clauses during the data copy to reduce the size of the database.
I've found several partial solutions but nothing that is able to copy schema objects and a custom restricted data without requiring lots of manual labor to ensure objects/data are copied in correct order to satisfy dependencies, FK constraints, etc. I fully expect to write the WHERE clause for each table manually, but am hoping the rest can be automated so we can use this easily, quickly, and frequently. Bonus points if it automatically picks up new database objects as they are added.
Any help is greatly appreciated.
Snapshot replication with conditions on tables. That way you will get your schema and data replicated whenever needed.
This article describes how to create a merge replication, but when you choose snapshot replication the steps are the same. And the most interesting part is Step 8: Filter Table Rows. of course, because with this you can filter out all the unnecessary data to get replicated. But this step needs to be done for every entity and if you've got like hundreds of them, then you'd better analyze how to do that programmatically instead of going through the wizard windows.
Based on reading around the web, stack overflow, and mostly these articles about db versioning that were linked from coding horror, I've made a stab at writing a plan for versioning the database of an 8 year old php mysql website.
Database Version Control plan
- Create a db as the "Master Database"
- Create a table db_version (id, script_name, version_number, author, comment, date_ran)
- Create baseline script for schema+core data that creates this db from scratch, run this on Master Db
- Create a "test data" script to load any db with working data
- Modifications to the master db are ONLY to be made through the db versioning process
- Ensure everyone developing against the Master Db has a local db created by the baseline script
- Procedures for commiting and updating from the Master Db
- Master Db Commit
- Perform a schema diff between your local db and the master db
- Perform a data diff on core data between your local db and master db
- If there are changes in either or both cases, combine these changes into an update script
- Collect the data to be added to a new row in db_version table, and add an insert for this into the script
- new version number = latest master db version number +1
- author
- comment
- The script must be named as changeScript_V.sql where V is the latest master db version +1
- Run the script against the master db
- If the script executed succesfully, add it to the svn repository
- Add the new db_version record to your local db_version table
- Update from Master Db
- Update your local svn checkout to have all the latest change scripts available
- compares your local db_version table to the master db_version table to determine which change scripts to run
- Run the required change scripts in order against your local db, which will also update your local db_version table
My first question is, does this sound correct?
My second question is, the commit process seems a bit complicated to do more than once a day. Is there a way to reliably automate it? Or should I not be commiting database changes often enough for it to matter?
Looking at your proposals, it doesn't seem like something that's feasible nor practical.
I was working in a company where we used more than 1k tables per database (very complex system), and it all worked fine like this:
Have one person in charge of the DB (lets call him DBPerson) - every script/db change has to pass through him. This will avoid any unnecessary changes, and some 'overlooks' of the issues (for example, if someone moves an index to perform better for his query, hi might destroy other persons work, maybe someone will create a table that is completely redundant and unnecessary, etc...). This will keep db clean and efficient. Even if it seems like this is too much work for one guy (or his deputy), in fact it isn't - the db usually rarely changes.
Each script has to pass validation through DBPerson
When the script is approved, the DBPerson assigns a number and puts it in 'update' folder/svn(...), with appropriate numbering (as you suggested, incremental numbers for example).
Next, if you have some continuous integration in place, the script gets picked up and updates the db (if you don't have continuous integration, do it manually).
Do not store entire database script, with all the data in script. Store the actual database instead. If you have branches of the solution - have each branch with it's own database, or you can always have update scripts divided for each of the branches so you could rollback/forward to another branch. But, I really recommend to have a separate db for each branch.
Have one database always with default data (intact) - for needs of unit tests, regression tests etc. Whenever you do the tests, do them on the copy of this database. You could even put a nightly cleanup of the test databases with the main one (if appropriate of course).
In an environment like this you'll have multiple versions of database:
Developers database (local) - the one that the dev guy is using to test his work. He can always copy from Master or Test Master.
Master database - the one with all the default values, maybe semi-empty if you're doing redeploys to new clients.
Test Master database - Master database filled with test data. Any scripts you have ran on Master you ran here as well.
Test in progress database - copied from Test Master and used for testing - gets overwritten prior to any new test.
If you have branches (similar database with slight difference for each of the clients) than you'll have the same as above for each branch...
You will most certainly have to make modifications of this to match your situation, but anyway I think that keeping the textual version of the create script for entire database is wrong in terms of maintainability, merging, updating, etc...
I have two different database.
One of them, original database and another one is cache database.
This databases are in different location.
Ones a day, I must update cache database from original database.
And I must this update progress with a Web Service which is working on Original Database machine.
I can it with clear all Cache DB Tables and Insert Original Datas in every progress.
But I think is a Bad scenario.
So how can I this update progress with efficiency.
And have you any suggestion.
I'm pretty sure that there are DB syncing technologies out there, but since you already have the requirement, I'd recommend to use a change-log.
So, you'll have a "CHANGE_LOG" table, to which you insert rows whenever you do "writes" on your tables (INSERT,UPDATE,DELETE). Once a day, you can apply these changes one-by-one to the cache DB.
Deleting the change-log once it's applied is okay, but you can also confer "version" to the DBs. So each change to the DB will increment the version number. That can be used to manage more than one chache DBs.
To provide additional assurance for example, you can have a trigger in the cache DB that increment their own version numbers. That way, your process can inquire a cache DB and will know what changes must be applied, without maintaining that in the master DB (that way, hooking up a new cache DB, bringing up a crashed cache DB up to date is easy, too.).
Note that you probably need to purge the change log from time to time.
The way I see it you're going to have to grab all the data from the source database, as you don't seem to have any way of interrogating it to see what data has changed. A simple way to do it would be to copy all the data from the source database into temporary or staging tables in the cache database. Then you can do a diff between both sets of tables and update the records that have changed. Or once you have all the data in the staging tables drop/rename the existing tables and rename the staging tables to the existing table names.