I'm looking for a way to get a diff of two states (S1, S2) in a database (Oracle), to compare and see what has changed between these two states. Best would be to see what statements I would have to apply to the database in state one (S1) to transform it to state two (S2).
The two states are from the same database (schema) at different points in time (some small amount of time, not weeks).
I was thinking about doing something like a snapshot and compare - but how to make the snapshots and how to compare them in the best way ?
Edit: I'm looking for changes in the data (primarily) and if possible objects.
This is one of those questions which are easy to state, and it seems the solution should be equally simple. Alas it is not.
The starting point is the data dictionary. From ALL_TABLES you can generate a set of statements like this:
select * from t1#dbstate2
minus
select * from t1#dbstate1
This will give you the set of rows that have been added or amended in dbstate2. You also need:
select * from t1#dbstate1
minus
select * from t1#dbstate2
This will give you the set of rows that have been deleted or amended in dbstate2. Obviously the amended ones will be included in the first set, it's the delta you need, which gives the deleted rows.
Except it's not that simple because:
When a table has a surrogate primary key (populated by a sequence)
then the primary key for the same record might have a different value
in each database. So you should exclude such primary keys from the
sets, which means you need to generated tailored projections for each
table using ALL_TAB_COLS and ALL_CONSTRAINTS, and you may have to use
your skill and judgement to figure out which queries need to exclude
the primary key.
Also, resolving foreign keys is problematic. If the foreign key is a
surrogate key (or even if it isn't) you need to look up the
referenced table to compare the meaning / description columns in the
two databases. But of course, the reference data could have different
state in the two databases, so you have to resolve that first.
Once you have a set of queries which identify the difference you are
ready for the next stage: generating the appliance statements. There
are two choices here: generating a set of INSERT, UPDATE and DELETE
statements or generating a set of MERGE statements. MERGE has the
advantage of idempotency but is a gnarly thing to generate. Probably
go for the easier option.
Remember:
For INSERT and UPDATE statements exclude columns which are populated by triggers or are generated (identity, virtual columns).
For INSERT and UPDATE statements you will need to join to referenced tables for populating foreign keys on the basis of description columns (unless you have already synchronised the primary key columns of all foreign key tables).
So this means you need to apply changes in the order dictated by foreign key dependencies.
For DELETE statements you need to cascade foreign key deletions.
You may consider dropping foreign keys and maybe other constraints, but then you may be in a right pickle when you come to re-apply them only to discover you have you have constraint violations.
Use DML Error Logging to track errors in bulk operations. Find out more.
If you need to manage change of schema objects too? Oh boy. You need to align the data structures first before you can even start doing the data comparison task. This is simpler than the contents, because it just requires interrogating the data dictionary and generating DDL statements. Even so, you need to run minus queries on ALL_TABLES (perhaps even ALL_OBJECTS) to see whether there are tables added to or dropped from the target database. For tables which are present in both you need to query ALL_TAB_COLS to verify the columns - names, datatype, length and precision, and probably mandatory too.
Just synchronising schema structures is sufficiently complex that Oracle sell the capability as a chargeable extra to the Enterprise Edition license, the Change Management Pack.
So, to confess. The above is a thought experiment. I have never done this. I doubt whether anybody ever has done this. For all but the most trivial of schemas generating DML to synchronise state is a monstrous exercise, which could take months to deliver (during which time the states of the two databases continue to diverge).
The straightforward solution? For a one-off exercise, Data Pump Export from S2, Data Pump Import into S1 using the table_exists_action=REPLACE option. Find out more.
For ongoing data synchronisation Oracle offers a variety of replication solutions. Their recommended approach is GoldenGate but that's a separately licensed product so of course they recommend it :) Replication with Streams is deprecated in 12c but it's still there. Find out more.
The solution for synchronising schema structure is simply not to need it: store all the DDL scripts in a source control repository and always deploy from there.
Related
The two databases have identical schemas, but distinct data. It's possible there will be some duplication of rows, but it's sufficient for the merge to bail noisily and not do the update if duplicates are found, i.e., duplicates should be resolved manually.
Part of the problem is that there are a number of foreign key constraints in the databases in question. Also, there may be some columns which reference foreign keys which do not actually have foreign key constraints. These latter are due to performance issues on insertion. Also, we need to be able to map between the ids from the old databases and the IDs in the new database.
Obviously, we can write a bunch of code to handle this, but we are looking for a solution which is:
Less work
Less overhead on the machines doing the merge.
More reliable. If we have to write code it will need to go through testing, etc. and isn't guaranteed to be bug free
Obviously we are still searching the web and the Postgresql documentation for the answer, but what we've found so far has been unhelpful.
Update: One thing I clearly left out is that "duplicates" are clearly defined by unique constraints in the schema. We expect to restore the contents of one database, then restore the contents of a second. Errors during the second restore should be considered fatal to the second restore. The duplicates should then be removed from the second database and a new dump created. We want the IDs to be renumbered, but not the other unique constraints. It's possible, BTW, that there will be a third or even a fourth database to merge after the second.
There's no shortcut to writing a bunch of scripts… This cannot realistically be automated, since managing conflicts requires applying rules that will be specific to your data.
That said, you can reduce the odds of conflicts by removing duplicate surrogate keys…
Say your two databases have only two tables: A (id pkey) and B (id pkey, a_id references A(id)). In the first database, find max_a_id = max(A.id) and max_b_id = max(B.id).
In the second database:
Alter table B if needed so that a_id does cascade updates.
Disable triggers if any have side effects that might erroneously kick in.
Update A and set id = id + max_a_id, and the same kind of thing for B.
Export the data
Next, import this data into the first database, and update sequences accordingly.
You'll still need to be wary of overflows if IDs can end up larger than 2.3 billion, and of unique keys that might exist in both databases. But at least you won't need to worry about dup IDs.
This is the sort of case I'd be looking into ETL tools like CloverETL, Pentaho Kettle or Talend Studio for.
I tend to agree with Denis that there aren't any real shortcuts to avoid dealing with the complexity of a data merge.
I am working in a postgresQL database that has 200+ plus tables, some with many (15+) constraints that are references to other tables. I was able to list all of the constraints from the pg_constraints tables.
I am trying to map the dependencies and tables referenced by each table in order to be able to manage writing to the tables from a web application. If a table has a dependency, I need to be sure the dependent tables are written to before that table, and if the table is referenced by another table, that the given table has the necessary rows before the table that references it is queried. How can I get a list of the tables in the order they would need to be written to,and in reverse, the order that would need to be followed to delete from multiple tables?
Two pieces of advice:
There can be circular dependencies, especially with triggers and rules. In a case like yours not unlikely. That's perfectly acceptable design, programming logic in triggers / rules / foreign keys etc. has to take care that you don't enter infinite loops. (The RDBMS can detect them when they happen.) But your quest to map out dependencies on a relational level may lead you in circles.
pgAdmin (current version 1.14) has dedicated tabs for "Dependencies" and "Dependents" in the main interface. Click on any object and get a full list. This could be rather helpful in your case - may even be all you need.
You can also turn on statement-logging of your database and read how pgAdmin retrieves that information in your logs. Set log_statement = all for that purpose.
Don't forget to reset log_statement, or your logfiles will get huge.
You can also try to change constraints to deferred.
I have a routine that will be creating individual tables (Sql Server 2008) to store the results of reports generated by my application (Asp.net 3.5). Each report will need its own table, as the columns for the table would vary based on the report settings. A table will contain somewhere between 10-5,000 rows, rarely more than 10,000.
The following usage rules will apply:
Once stored, the data will never be updated.
Whenever results for the table are accessed, all data will be retrieved.
No other table will need to perform a join with this table.
Knowing this, is there any reason to create a PK index column on the table? Will doing so aid the performance of retrieving the data in any way, and if it would, would this outweigh the extra load of updating the index when inserting data (I know that 10K records is a relatively small amount, but this solution needs to be able to scale).
Update: Here are some more details on the data being processed, which goes into the current design decision of one table per report:
Tables will record a set of numeric values (set at runtime based on the report settings) that correspond to a different set of reference varchar values (also set at runtime based on the report settings).
Whenever data is retrieved, it some post-processing on the server will be required before the output can be displayed to the user (thus I will always be retrieving all values).
I would also be suspicious of someone claiming that they had to create a new table for each time the report was run. However, given that different columns (both in number, name and datatype) could conceivably be needed for every time the report was run, I don't see a great alternative.
The only other thing I can think of is to have an ID column (identifying the ReportVersionID, corresponding to another table), ReferenceValues column (varchar field, containing all Reference values, in a specified order, separated by some delimiter) and NumericValues column (same as ReferenceValues, but for the numbers), and then when I retrieve the results, put everything into specialized objects in the system, separating the values based on the defined delimiter). Does this seem preferable?
Primary keys are not a MUST for any and all data tables. True, they are usually quite useful and to abandon them is unwise. However, in addition to a primary missions of speed (which I agree would doubtfully be positively affected) is also that of uniqueness. To that end, and valuing the consideration you've already obviously taken, I would suggest that the only need for a primary key would be to govern the expected uniqueness of the table.
Update:
You mentioned in a comment that if you did a PK that it would include an Identity column that presently does not exist and is not needed. In this case, I would advise against the PK altogether. As #RedFilter pointed out, surrogate keys never add any value.
I would keep it simple, just store the report results converted to json or xml, in a VARCHAR(MAX) column
One of the most useful and least emphasized (explicitly) benefits of data integrity (primary keys and foreign key references to start with) is that it forces a 'design by contract' between your data and your application(s); which stops quite a lot of types of bugs from doing any damage to your data. This is such a huge win and a thing that is implicitly taken for granted (it is not 'the database' that protects it, but the integrity rules you specify; forsaking the rules you expose your data to various levels of degradation).
This seems unimportant to you (from the fact that you did not even discuss what would be a possible primary key) and your data seems quite unrelated to other parts of the system (from the fact that you will not do joins to any other tables); but still - if all things are equal I would model the data properly and then if primary keys (or other data integrity rules) are not used and if chasing every last bit of performance I would consider dropping them in production (and test for any actual gains).
As for comments that creating tables is a performance hit - that is true, but you did not tell us how temporary are these tables? Once created will they be heavily used before scrapped? Or do you plan to create tables for just dozen of read operations.
In case you will heavily use these tables and if you will provide clean mechanism for managing them (removing them when not used, selecting them, etc...) I think that dynamically creating the tables would be perfectly fine (you could have shared more details on the tables themselves; use case would be nice)
Notes on other solutions:
EAV model
is horrible unless very specific conditions are met (for example: flexibility is paramount and automating DDL is too much of a hassle). Keep away from it (or be very, very good at anticipating what kinds of queries will you have to deal with and rigorous in validating data on the front end).
XML/BLOB approach
might be the right thing for you if you will consume the data as XML/BLOBs at presentation layer (always read all of the rows, always write the whole 'object' and finally, if your presentation layer likes XML/BLOBS)
EDIT:
Also, depending on the usage patterns, having primary key can indeed increase the speed of retrieval, and if I can read the fact that the data will not be updated as 'it will be written once and read many times' then there is a good chance that it will indeed overweight the cost of updating the index on inserts.
will it 1 table for every run of a given report, or one table to all runs of a given report? in other words, if you have Report #1 and you run it 5 times, over a different range of data, will you produce 5 tables, or will all 5 runs of the report be stored in the same table?
If you are storing all 5 runs of the report in the same table, then you'll need to filter the data so that it is appropriate to the run in question. in this case, having a primary key will let you do the where statement for the filter, much faster.
if you are creating a new table for every run of the report, then you don't need a primary key. however, you are going to run into to other performance problems as the number of tables in your system grows... assuming you don't have something in place to drop old data / tables.
If you are really not using the tables for anything other than as a chunk of read-only data, you could just as well store all the reports in a single table, as XML values.
What column or columns would the PK index be built on? If just a surrogate identity column, you'll have no performance hit when inserting rows, as they'd be inserted "in order". If it is not a surrogate key, then you have the admittedly minor but still useful assurance that you don't have duplicate entries.
Is the primary key used to control the order in which report rows are to be printed? If not, then how do you ensure proper ordering of the information? (Or is this just a data table that gets summed one way and another whenever a report is generated?)
If you use a clustered primary key, you wouldn't use as much storage space as you would with a non-clustered index.
By and large, I find that while not every table requires a primary key, it does not hurt to have one present, and since proper relational database design requires primary keys on all tables, it's good practice to always include them.
I have the same database running on two different machines. The DB's make extensive use of Identity columns, and the tables have clashed pretty horribly. I now want to merge these two together before sorting out the undelying issue which I may do by
A) Using GUIDs (unweildy but works everywhere)
B) Assigning Identity ranges, kind of naff, but means you can still access records in order, knock up basic Sql and select records easily, and it identifies which machine originated the data.
My question is, what's the best way of re-keying (ie changing the primary keys) on one of the databases so the data no longer clashes. We're only looking at 6 tables total, but lots of rows ~2M in the 3 tables.
Update - is there any real sql code out there that does this, I know about Identity Insert etc. I've solved this issue in a number of in-elegant ways before, and I was looking for the elegant solution, preferable with a nice TSQL SP to do the donkey work - if that doesn't exist I'll code it up and place on wiki.
A simplistic way is to change all keys on the one of the databases by a fixed increment, say 10,000,000, and they will line up. In order to do this, you will have to bring the applications down so the database is quiet and drop all FK references affected by this, recreating them when finished. You will also have to reset the seed value on all affected identity columns to an appropriate value.
Some of the tables will be reference data, which will be more complicated to merge if it is not in sync. You could possibly have issues with conflicting codes meaning the same thing on different instances or the same code having different meanings. This may or may not be an issue with your application but if the instances have been run without having this coordinated between them you might want to check carefully for this.
Also, data like names and addresses are very likely to be out of sync if there wasn't a canonical source for these. You may need to get these out, run a matching query and get the business to tidy up any exceptions.
I would add another column to the table first, populate that with the new Primary key.
Then I'd use update statements to update the new foreign key fields in all related tables.
Then you can drop the old Primary key and old foreign key fields.
I'm currently working on someone else's database where the primary keys are generated via a lookup table which contains a list of table names and the last primary key used. A stored procedure increments this value and checks it is unique before returning it to the calling 'insert' SP.
What are the benefits for using a method like this (or just generating a GUID) instead of just using the Identity/Auto-number?
I'm not talking about primary keys that actually 'mean' something like ISBNs or product codes, just the unique identifiers.
Thanks.
An auto generated ID can cause problems in situations where you are using replication (as I'm sure the techniques you've found can!). In these cases, I generally opt for a GUID.
If you are not likely to use replication, then an auto-incrementing PK will most likely work just fine.
There's nothing inherently wrong with using AutoNumber, but there are a few reasons not to do it. Still, rolling your own solution isn't the best idea, as dacracot mentioned. Let me explain.
The first reason not to use AutoNumber on each table is you may end up merging records from multiple tables. Say you have a Sales Order table and some other kind of order table, and you decide to pull out some common data and use multiple table inheritance. It's nice to have primary keys that are globally unique. This is similar to what bobwienholt said about merging databases, but it can happen within a database.
Second, other databases don't use this paradigm, and other paradigms such as Oracle's sequences are way better. Fortunately, it's possible to mimic Oracle sequences using SQL Server. One way to do this is to create a single AutoNumber table for your entire database, called MainSequence, or whatever. No other table in the database will use autonumber, but anyone that needs a primary key generated automatically will use MainSequence to get it. This way, you get all of the built in performance, locking, thread-safety, etc. that dacracot was talking about without having to build it yourself.
Another option is using GUIDs for primary keys, but I don't recommend that because even if you are sure a human (even a developer) is never going to read them, someone probably will, and it's hard. And more importantly, things implicitly cast to ints very easily in T-SQL but can have a lot of trouble implicitly casting to a GUID. Basically, they are inconvenient.
In building a new system, I'd recommend using a dedicated table for primary key generation (just like Oracle sequences). For an existing database, I wouldn't go out of my way to change it.
from CodingHorror:
GUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes
The article provides a lot of good external links on making the decision on GUID vs. Auto Increment. If I can, I go with GUID.
It's useful for clients to be able to pre-allocate a whole bunch of IDs to do a bulk insert without having to then update their local objects with the inserted IDs. Then there's the whole replication issue, as mentioned by Galwegian.
The procedure method of incrementing must be thread safe. If not, you may not get unique numbers. Also, it must be fast, otherwise it will be an application bottleneck. The built in functions have already taken these two factors into account.
My main issue with auto-incrementing keys is that they lack any meaning
That's a requirement of a primary key, in my mind -- to have no other reason to exist other than identifying a record. If it has no real-world meaning, then it has no real-world reason to change. You don't want primary keys to change, generally speaking, because you have to search-replace your whole database or worse. I have been surprised at the sorts of things I have assumed would be unique and unchanging that have not turned out to be years later.
Here's the thing with auto incrementing integers as keys:
You HAVE to have posted the record before you get access to it. That means that until you have posted the record, you cannot, for example, prepare related records that will be stored in another table, or any one of a lot of other possible reasons why it might be useful to have access to the new record's unique id, before posting it.
The above is my deciding factor, whether to go with one method, or the other.
Using a unique identifiers would allow you to merge data from two different databases.
Maybe you have an application that collects data in multiple database and then "syncs" with a master database at various times in the day. You wouldn't have to worry about primary key collisions in this scenario.
Or, possibly, you might want to know what a record's ID will be before you actually create it.
One benefit is that it can allow the database/SQL to be more cross-platform. The SQL can be exactly the same on SQL Server, Oracle, etc...
The only reason I can think of is that the code was written before sequences were invented and the code forgot to catch up ;)
I would prefer to use a GUID for most of the scenarios in which the post's current method makes any sense to me (replication being a possible one). If replication was the issue, such a stored procedure would have to be aware of the other server which would have to be linked to ensure key uniqueness, which would make it very brittle and probably a poor way of doing this.
One situation where I use integer primary keys that are NOT auto-incrementing identities is the case of rarely-changed lookup tables that enforce foreign key constraints, that will have a corresponding enum in the data-consuming application. In that scenario, I want to ensure the enum mapping will be correct between development and deployment, especially if there will be multiple prod servers.
Another potential reason is that you deliberately want random keys. This can be desirable if, say, you don't want nosey browsers leafing through every item you have in the database, but it's not critical enough to warrant actual authentication security measures.
My main issue with auto-incrementing keys is that they lack any meaning.
For tables where certain fields provide uniqueness (whether alone or in combination with another), I'd opt for using that instead.
A useful side benefit of using a GUID primary key instead of an auto-incrementing one is that you can assign the PK value for a new row on the client side (in fact you have to do this in a replication scenario), sparing you the hassle of retrieving the PK of the row you just added on the server.
One of the downsides of a GUID PK is that joins on a GUID field are slower (unless this has changed recently). Another upside of using GUIDs is that it's fun to try and explain to a non-technical manager why a GUID collision is rather unlikely.
Galwegian's answer is not necessarily true.
With MySQL you can set a key offset for each database instance. If you combine this with a large enough increment it will for fine. I'm sure other vendors would have some sort of similar settings.
Lets say we have 2 databases we want to replicate. We can set it up in the following way.
increment = 2
db1 - offset = 1
db2 - offset = 2
This means that
db1 will have keys 1, 3, 5, 7....
db2 will have keys 2, 4, 6, 8....
Therefore we will not have key clashes on inserts.
The only real reason to do this is to be database agnostic (if different db versions use different auto-numbering techniques).
The other issue mentioned here is the ability to create records in multiple places (like in the central office as well as on traveling users' laptops). In that case, though, you would probably need something like a "sitecode" that was unique to each install that was prefixed to each ID.