I am working in a postgresQL database that has 200+ plus tables, some with many (15+) constraints that are references to other tables. I was able to list all of the constraints from the pg_constraints tables.
I am trying to map the dependencies and tables referenced by each table in order to be able to manage writing to the tables from a web application. If a table has a dependency, I need to be sure the dependent tables are written to before that table, and if the table is referenced by another table, that the given table has the necessary rows before the table that references it is queried. How can I get a list of the tables in the order they would need to be written to,and in reverse, the order that would need to be followed to delete from multiple tables?
Two pieces of advice:
There can be circular dependencies, especially with triggers and rules. In a case like yours not unlikely. That's perfectly acceptable design, programming logic in triggers / rules / foreign keys etc. has to take care that you don't enter infinite loops. (The RDBMS can detect them when they happen.) But your quest to map out dependencies on a relational level may lead you in circles.
pgAdmin (current version 1.14) has dedicated tabs for "Dependencies" and "Dependents" in the main interface. Click on any object and get a full list. This could be rather helpful in your case - may even be all you need.
You can also turn on statement-logging of your database and read how pgAdmin retrieves that information in your logs. Set log_statement = all for that purpose.
Don't forget to reset log_statement, or your logfiles will get huge.
You can also try to change constraints to deferred.
Related
I'm looking for a way to get a diff of two states (S1, S2) in a database (Oracle), to compare and see what has changed between these two states. Best would be to see what statements I would have to apply to the database in state one (S1) to transform it to state two (S2).
The two states are from the same database (schema) at different points in time (some small amount of time, not weeks).
I was thinking about doing something like a snapshot and compare - but how to make the snapshots and how to compare them in the best way ?
Edit: I'm looking for changes in the data (primarily) and if possible objects.
This is one of those questions which are easy to state, and it seems the solution should be equally simple. Alas it is not.
The starting point is the data dictionary. From ALL_TABLES you can generate a set of statements like this:
select * from t1#dbstate2
minus
select * from t1#dbstate1
This will give you the set of rows that have been added or amended in dbstate2. You also need:
select * from t1#dbstate1
minus
select * from t1#dbstate2
This will give you the set of rows that have been deleted or amended in dbstate2. Obviously the amended ones will be included in the first set, it's the delta you need, which gives the deleted rows.
Except it's not that simple because:
When a table has a surrogate primary key (populated by a sequence)
then the primary key for the same record might have a different value
in each database. So you should exclude such primary keys from the
sets, which means you need to generated tailored projections for each
table using ALL_TAB_COLS and ALL_CONSTRAINTS, and you may have to use
your skill and judgement to figure out which queries need to exclude
the primary key.
Also, resolving foreign keys is problematic. If the foreign key is a
surrogate key (or even if it isn't) you need to look up the
referenced table to compare the meaning / description columns in the
two databases. But of course, the reference data could have different
state in the two databases, so you have to resolve that first.
Once you have a set of queries which identify the difference you are
ready for the next stage: generating the appliance statements. There
are two choices here: generating a set of INSERT, UPDATE and DELETE
statements or generating a set of MERGE statements. MERGE has the
advantage of idempotency but is a gnarly thing to generate. Probably
go for the easier option.
Remember:
For INSERT and UPDATE statements exclude columns which are populated by triggers or are generated (identity, virtual columns).
For INSERT and UPDATE statements you will need to join to referenced tables for populating foreign keys on the basis of description columns (unless you have already synchronised the primary key columns of all foreign key tables).
So this means you need to apply changes in the order dictated by foreign key dependencies.
For DELETE statements you need to cascade foreign key deletions.
You may consider dropping foreign keys and maybe other constraints, but then you may be in a right pickle when you come to re-apply them only to discover you have you have constraint violations.
Use DML Error Logging to track errors in bulk operations. Find out more.
If you need to manage change of schema objects too? Oh boy. You need to align the data structures first before you can even start doing the data comparison task. This is simpler than the contents, because it just requires interrogating the data dictionary and generating DDL statements. Even so, you need to run minus queries on ALL_TABLES (perhaps even ALL_OBJECTS) to see whether there are tables added to or dropped from the target database. For tables which are present in both you need to query ALL_TAB_COLS to verify the columns - names, datatype, length and precision, and probably mandatory too.
Just synchronising schema structures is sufficiently complex that Oracle sell the capability as a chargeable extra to the Enterprise Edition license, the Change Management Pack.
So, to confess. The above is a thought experiment. I have never done this. I doubt whether anybody ever has done this. For all but the most trivial of schemas generating DML to synchronise state is a monstrous exercise, which could take months to deliver (during which time the states of the two databases continue to diverge).
The straightforward solution? For a one-off exercise, Data Pump Export from S2, Data Pump Import into S1 using the table_exists_action=REPLACE option. Find out more.
For ongoing data synchronisation Oracle offers a variety of replication solutions. Their recommended approach is GoldenGate but that's a separately licensed product so of course they recommend it :) Replication with Streams is deprecated in 12c but it's still there. Find out more.
The solution for synchronising schema structure is simply not to need it: store all the DDL scripts in a source control repository and always deploy from there.
The two databases have identical schemas, but distinct data. It's possible there will be some duplication of rows, but it's sufficient for the merge to bail noisily and not do the update if duplicates are found, i.e., duplicates should be resolved manually.
Part of the problem is that there are a number of foreign key constraints in the databases in question. Also, there may be some columns which reference foreign keys which do not actually have foreign key constraints. These latter are due to performance issues on insertion. Also, we need to be able to map between the ids from the old databases and the IDs in the new database.
Obviously, we can write a bunch of code to handle this, but we are looking for a solution which is:
Less work
Less overhead on the machines doing the merge.
More reliable. If we have to write code it will need to go through testing, etc. and isn't guaranteed to be bug free
Obviously we are still searching the web and the Postgresql documentation for the answer, but what we've found so far has been unhelpful.
Update: One thing I clearly left out is that "duplicates" are clearly defined by unique constraints in the schema. We expect to restore the contents of one database, then restore the contents of a second. Errors during the second restore should be considered fatal to the second restore. The duplicates should then be removed from the second database and a new dump created. We want the IDs to be renumbered, but not the other unique constraints. It's possible, BTW, that there will be a third or even a fourth database to merge after the second.
There's no shortcut to writing a bunch of scripts… This cannot realistically be automated, since managing conflicts requires applying rules that will be specific to your data.
That said, you can reduce the odds of conflicts by removing duplicate surrogate keys…
Say your two databases have only two tables: A (id pkey) and B (id pkey, a_id references A(id)). In the first database, find max_a_id = max(A.id) and max_b_id = max(B.id).
In the second database:
Alter table B if needed so that a_id does cascade updates.
Disable triggers if any have side effects that might erroneously kick in.
Update A and set id = id + max_a_id, and the same kind of thing for B.
Export the data
Next, import this data into the first database, and update sequences accordingly.
You'll still need to be wary of overflows if IDs can end up larger than 2.3 billion, and of unique keys that might exist in both databases. But at least you won't need to worry about dup IDs.
This is the sort of case I'd be looking into ETL tools like CloverETL, Pentaho Kettle or Talend Studio for.
I tend to agree with Denis that there aren't any real shortcuts to avoid dealing with the complexity of a data merge.
I have a scenario wherein I have users uploading their custom preferences to the conversion being done by our utility. So, the problem we are facing is
should we keep these custom mappings in one table with an extra column of UID or maybe GUID
,
or
should we create tables on the fly for every user that registers with us and provide him/her a separate table for their custom mappings altogether?
We're more inclined to having just an extra column and one table for the custom mappings, but does the redundant data in that one column have any effect on performance? On the other hand, logic dictates us to normalize our tables and have separate tables. I'm rather confused as to what to do.
Common practice is to use a single table with a user id. If there are commonly reused preferences, you can instead refactor these out into a separate table and reference them in one go, but that is up to you.
Why is the creating one table per user a bad idea? Well, I don't think you want one million tables in your database if you end up with one million users. Doing it this way will also guarantee your queries will be unwieldy at best, often slower than they ought to be, and often dynamically generated and EXECed - usually a last resort option and not necessarily safe.
I inherited a system where the original developers had chosen to go with the one table per client approach. We are now rewriting the system! From a dev perspective, you would end up with a tangled mess of dynamic SQL and looping. The DBAs will not be happy because they will wake up to an entirely different database every morning, which makes performance tuning and maintenance impossible.
You could have an additional BLOB or XML column in the users table, storing the preferences. This is called the property bag approach, but I would not recommend this either, as it will hamper query performance in many cases.
The best approach in this scenario is to solve the problem with normalisation. A separate table for the properties, joined to the users table with a PK/FK relationship. I would not recommend using a GUID. This GUiD would likely end-up as the PK, meaning you will have a 16-byte Clustered Index, and that will also be carried through to all non-clustered indexes on the table. You would probably also be building a non-cluster index on this in the Users table. Also, you could run the risk of falling into the pitfall of generating these GUIDs with NEW_ID() rather than NEW_SEQUENTIALID() which will lead to massive fragmentation.
If you take the normalisation approach and you have concerns about returning the results across the two tables in a tabular format, then you can simply use a PIVOT operator.
What is the advantage of defining a foreign key when working with an MVC framework that handles the relation?
I'm using a relational database with a framework that allows model definitions with relations. Because the foreign keys are defined through the models, it seems like foreign keys are redundant. When it comes to managing the database of an application in development, editing/deleting tables that are using foreign keys is a hassle.
Is there any advantage to using foreign keys that I'm forgoing by dropping the use of them altogether?
Foreign keys with constraints(in some DB engines) give you data integrity on the low level(level of database).
It means you can't physically create a record that doesn't fulfill relation.
It's just a way to be more safe.
It gives you data integrity that's enforced at the database level. This helps guard against possibly error in application logic that might cause invalid data.
If any data manipulation is ever done directly in SQL that bypasses your application logic, it also guards against bad data that breaks those constraints.
An additional side-benefit is that it allows tools to automatically generating database diagrams with relationships inferred from the schema itself. Now in theory all the diagramming should be done before the database is created, but as the database evolves beyond its initial incarnation these diagrams often aren't kept up to date, and the ability to generate a diagram from an existing database is helpful both for reviewing, as well as for explaining the structure to new developers joining a project.
It might be a helpful to disable FKs while the database structure is still in flux, but they're good safeguard to have when the schema is more stabilized.
A foreign key guarantees a matching record exists in a foreign table. Imagine a table called Books that has a FK constraint on a table called Authors. Every book is guaranteed to have an Author.
Now, you can do a query such as:
SELECT B.Title, A.Name FROM Books B
INNER JOIN Authors A ON B.AuthorId = A.AuthorId;
Without the FK constraint, a missing Author row would cause the entire Book row to be dropped, resulting in missing books in your dataset.
Also, with the FK constraint, attempting to delete an author that was referred to by at least one Book would result in an error, rather than corrupting your database.
Whilst they may be a pain when manipulating development/test data, they have saved me a lot of hassle in production.
Think of them as a way to maintain data integrity, especially as a safeguard against orphaned records.
For example, if you had a database relating many PhoneNumber records to a Person, what happens to PhoneNumber records when the Person record is deleted for whatever reason?
They will still exist in the database, but the ID of the Person they relate to will no longer exist in the relevant Person table and you have orphaned records.
Yes, you could write a trigger to delete the PhoneNumber whenever a Person gets removed, but this could get messy if you accidentally delete a Person and need to rollback.
Yes, you may remember to get rid of the PhoneNumber records manually, but what about other developers or methods you write 9 months down the line?
By creating a Foreign Key that ensures any PhoneNumber is related to an existing Person, you both insure against destroying this relationship and also add 'clues' as to the intended data structure.
The main benefits are data integrity and cascading deletes. You can also get a performance gain when they're defined and those fields are properly indexed. For example, you wouldn't be able to create a phone number that didn't belong to a contact, or when you delete the contact you can set it to automatically delete all of their phone numbers. Yes, you can make those connections in your UI or middle tier, but you'll still end up with orphans if someone runs an update directly against the server using SQL rather than your UI. The "hassle" part is just forcing you to consider those connections before you make a bulk change. FKs have saved my bacon many times.
I have the same database running on two different machines. The DB's make extensive use of Identity columns, and the tables have clashed pretty horribly. I now want to merge these two together before sorting out the undelying issue which I may do by
A) Using GUIDs (unweildy but works everywhere)
B) Assigning Identity ranges, kind of naff, but means you can still access records in order, knock up basic Sql and select records easily, and it identifies which machine originated the data.
My question is, what's the best way of re-keying (ie changing the primary keys) on one of the databases so the data no longer clashes. We're only looking at 6 tables total, but lots of rows ~2M in the 3 tables.
Update - is there any real sql code out there that does this, I know about Identity Insert etc. I've solved this issue in a number of in-elegant ways before, and I was looking for the elegant solution, preferable with a nice TSQL SP to do the donkey work - if that doesn't exist I'll code it up and place on wiki.
A simplistic way is to change all keys on the one of the databases by a fixed increment, say 10,000,000, and they will line up. In order to do this, you will have to bring the applications down so the database is quiet and drop all FK references affected by this, recreating them when finished. You will also have to reset the seed value on all affected identity columns to an appropriate value.
Some of the tables will be reference data, which will be more complicated to merge if it is not in sync. You could possibly have issues with conflicting codes meaning the same thing on different instances or the same code having different meanings. This may or may not be an issue with your application but if the instances have been run without having this coordinated between them you might want to check carefully for this.
Also, data like names and addresses are very likely to be out of sync if there wasn't a canonical source for these. You may need to get these out, run a matching query and get the business to tidy up any exceptions.
I would add another column to the table first, populate that with the new Primary key.
Then I'd use update statements to update the new foreign key fields in all related tables.
Then you can drop the old Primary key and old foreign key fields.