How to compare data of tables from different databases in postgresql? - database

I have DB- A and B. Both the DBs have a table called 'company'.
I want to check whether both the tables are identical with their data or not.

You could export the data from both tables (with the same ORDER BY clause) and compare the resulting files, or you could define a postgres_fdw foreign table that presents one table in the other database, then use EXCEPT to compute differences.

Related

Data Warehouse design(BigQuery), load into dimensional table independent of fact table

I want to design a data warehouse (Data MART) with one fact table and 2 dimensional tables, where the data mart takes some Slowly Changing Dimensions into consideration, with surrogate key. I'm wondering how I can model this so that data insertion to the dimensional tables can be made independent (inserted before fact table row exist) of the fact table. The data will be streamed from PubSub to BigQuery via Dataflow, thus some of the dimensional data might arrive earlier, needing to be inserted into the dimensional table before the fact data.
I don't completely understand your question. Dimensions are always (or rather, almost always) populated before fact tables are, since fact table records refer to dimensions (and not the other way around).
If you're worried about being able to destroy and rebuild your dimension table without having to also rebuild your fact table, then you'll need to use some sort of surrogate key pipeline to maintain your surrogate key to natural key relationships. But again, I'm not sure that this is what you're asking.
BigQuery does not perform referential integrity check, which means it will not check whether parent row exists in dimension table while inserting child row into fact table and you don't need this in data analytics setup. You can keep appending records to both fact table and dimension tables independently in BigQuery.
Flatten / denormalise the table and keep dimensions in fact tables - repeated records are not going to be an issue in BigQuery - you can make use of Clustering and Partitioning
Other option is, if you have dimensions in RDBMS system, upload dimension tables as files to Cloud Storage / rows to Cloud SQL and join them in Dataflow, in this case you can skip multiple sinks - you can write to a flatten schema into a BigQuery table sink.
Inserting order does not matter in BigQuery, you can reference event records based on pubsub message publishing time / source event time, etc.

Implementing temporal tables for dimensions for tracking changes

I am working on a star schema and I want to track the history of data for some dimensions and specifically for some columns. Is it possible to work with temporal tables as an other alternative ? If yes, how to store the current record in a temporal table? Also, is it logic that the source of my dimension will be the historical table of my temporal table?
Determining if two rows or expressions are equal can be a difficult and resource intensive process. This can be the case with UPDATE statements where the update was conditional based on all of the columns being equal or not for a specific row.
To address this need in the SQL Server environment the CHECKSUM function ,in your case is helpful as it natively creates a unique expression for comparison between two records.
So you will compare between your two sources which are logically the ODS and Datawarehouse. If the Chescksum between two different sources isn't the same, you will update the old record and insert the new updated one.

DB2: Update query with several tables involved

I have two tables, A and B. Both are related by a field. What I want to do is update a field of a subset of data from table B. This subset is filtered by a table A data. The data to be set is also taken from table B but from another different subset of data also filtered by another data from table A. Can this be done in DB2 using a direct query?
Thank you

Identifying where joins can be made between two sql server tables

Very new to sql server. I have a db with about 20 tables each with around 40 columns. How can I select two tables and see if they have any columns in common?
I basically want to see where I can make joins.. If there's a better way of quickly telling where I can combine info from two tables that could be helpful too.
First of all, in relational databases there is not such a concept of "joinable tables and/or columns". You can always list two relations (= tables) crossing every row in one relation with each row of the other (the cross/carthesian product of them) and then filter those based on some predicate (also called a "join", if the predicate involves columns of both relations).
The idea of "joinable" tables/columns comes into being only when thinking about the database schema. The schema's author can ask the database engine to enforce some referential integrity, by means of foreign keys.
Now if your database schema is well done (that is, its author was kind/clever enough to put referential integrity all over the schema) you can have a clue of which tables are joinable (by which columns).
To find those foreign keys, for each table you can run sp_help 'databasename.tablename' (you can omit the databasename. part, if it is the current database).
This command will output some facts about the given table, like its columns (along with their datatypes, requiredness, ...), its indexes and so on. Somewhere near the end it will list foreign keys along with where (if ever) its primary key is imported as foreign key on other tables.
For each key imported as foreign key on other table you have a candidate predicate for a join.
Please note that this procedure will only work if the foreign keys are set correctly. If they aren't, you can fix your database schema (but to do this you must know already which tables are joinable anyway). Also it won't show you joinable tables on other databases (in the same or linked server).
This also won't work for views.
Try to see in the SQL Management Studio, in the database diagram, there you find the relations between tables.

How to create a 'sanitized' copy of our SQL Server database?

We're a manufacturing company, and we've hired a couple of data scientists to look for patterns and correlation in our manufacturing data. We want to give them a copy of our reporting database (SQL 2014), but it must be in a 'sanitized' form. This means that all table names get converted to 'Table1', 'Table2' etc., and column names in each table become 'Column1', 'Column2' etc. There will be roughly 100 tables, some having 30+ columns, and some tables have 2B+ rows.
I know there is a hard way to do this. This would be to manually create each table, with the sanitized table name and column names, and then use something like SSIS to bulk insert the rows from one table to another. This would be rather time consuming and tedious because of the manual SSIS column mapping required, and manual setup of each table.
I'm hoping someone has done something like this before and has a much faster, more efficienct, way.
By the way, the 'sanitized' database will have no indexes or foreign keys. Also, it may seem to make any sense why we would want to do this, but this is what was agreed to by our Director of Manufacturing and the data scientists, as the first round of analysis which will involve many rounds.
You basically want to scrub the data and objects, correct? Here is what I would do.
Restore a backup of the db.
Drop all objects not needed (indexes, constraints, stored procedures, views, functions, triggers, etc.)
Create a table with two columns, populate the table, each row has orig table name and new table name
Write a script that iterates through the table, roe by row, and renames your tables. Better yet, put the data into excel, and create a third column that builds the tsql you want to build, then cut/paste and execute in ssms.
Repeat step 4, but for all columns. Best to query sys.columns to get all the objects you need, put to excel, and build your tsql
Repeat again for any other objects needed.
Backip/restore will be quicker than dabbling in SSIS and data transfer.
They can see the data but they can't see the column names? What can that possibly accomplish? What are you protecting by not revealing the table or column names? How is a data scientist supposed to evaluate data without context? Without a FK all I see is a bunch of numbers on a column named colx. What are expecting to accomplish? Get a confidentially agreement. Consider a FK columns customerID verses a materialID. Patterns have widely different meanings and analysis. I would correlate a quality measure with materialID or shiftID but not with a customerID.
Oh look there is correlation between tableA.colB and tableX.colY. Well yes that customer is college team and they use aluminum bats.
On top of that you strip indexes (on tables with 2B+ rows) so the analysis they run will be slow. What does that accomplish?
As for the question as stated do a back up restore. Using system table drop all triggers, FK, index, and constraints. Don't forget to drop the triggers and constraints - that may disclose some trade secret. Then rename columns and then tables.

Resources