SSIS - Looking Up Records from Different Databases - sql-server

I have a source table in a Sybase database (ORDERS) and a Source table in an MSSQL Database (DOCUMENTS). I need to query the Sybase database and for each row found in the ORDERS table get the matching row(s) by order number from the DOCUMENTS table.
I originally wrote the SSIS package using a lookup transformation, simple, except that it could be a one-to-many relationship, where 1 order number will exist in the ORDERS table but more than 1 documents could exist in the DOCUMENTS table. The SSIS lookup will only match on first.
My 2nd attempt will be to stage the rows from the ORDERS table into a staging table in MSSQL and then loop through the rows in this table using a FOR EACH LOOP CONTAINER and get the matching rows from the DOCUMENTS table, inserting the DOCUMENTS rows into another staging table. After all rows from ORDERS have been processed I will write a query to join the two staging tables to give me my result. A concern with this method is that I will be opening and closing the DOCUMENTS database connection many times, which will not be very efficient (although there will probably be less than 200 records).
Or could you let me know of any other way of doing this?

Related

Synchronize table between two different databases

Once a day I have to synchronize table between two databases.
Source: Microsoft SQL Server
Destination: PostgreSQL
Table contains up to 30 million rows.
For the first time i will copy all table, but then for effectiveness my plan is to insert/update only changed rows.
In this way if I delete row from source database, it will not be deleted from the destination database.
The problem is that I don’t know which rows were deleted from the source database.
My dirty thoughts right now tend to use binary search - to compare the sum of the rows on each side and thus catch the deleted rows.
I’m at a dead end - please share your thoughts on this...
In SQL Server you can enable Change Tracking to track which rows are Inserted, Updated, or Deleted since the last time you synchronized the tables.
with TDS FDWs (Foreign Data Wrapper), map the source table with a temp table in pg, an use a join to find/exclude the rows that you need.

SQL Server (via Pentaho): Delete takes way too long, just two rows per second

I am using the Delete-Step of Pentaho in order to delete about 200k rows (with 35 columns each) on an MSSQL Server Express table. Primary key is the condition I do sort the rows upon! Commit size is at 10k.
Performance of the Server should not be the issue because I am able to insert with a speed of over 1000 rows per second.
Tried the same steps with a table that does not have a primary key constraint. Same issue.
Would appreciate any help!
how should the statement look like if I don't want to type in all the 400k PK-numbers into the WHERE-clause?
Not sure what strategy Pentaho is using to run the deletes, but you might try loading the 400k IDs into a staging table or temporary table, and referencing that in the DELETE. eg
delete from maintable where id in (select id from maintable_ids_to_delete)

What are the best Indexes for a frequently changing table?

I work on databases used for Analysis workloads so I usually use a stored procedure to output final datasets into SQL Server tables that we can connect to from Tableau or SAS, etc.
We process multiple batches of data through the same system so the output dataset tables all contain a BATCH_ID column which users use to filter on the specific batch they want to analyze.
Each time a dataset is published, I delete any old data for that batch in the output table before inserting a fresh set of rows for that batch of data. For this type of workload what do you think the best indexes would be?
I'm currently using a clustered index on the BATCH_ID column because I figured that would group all the rows together resulting in efficient filtering and deletion/insertion. Will this result in a lot of index or table fragmentation over time? Keep in mind that the entire batch is deleted and re-inserted each time so there's no issue with partial updates or additions to existing batches.
Would I be better off with a typical clustered index on an identity column and a non-clustered index on batch_ID?

SQL Normalizing array of tables into multiple new tables

I have a database with 51 tables all with the same schema (one table per state). Each table has a couple million rows and about 50 columns.
I've normalized the columns into 6 other tables, and now I want to import all of the data from those 51 tables into the 6 new tables. The column names are all the same, and so I'm hoping I can automate the process of importing all the data.
I'm assuming what I'll need to do is:
Select the names of all the lists that have the raw schema
SELECT TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'raw'
Iterate over all the results
Grab all rows from that table, and SELECT INTO the appropriate cols into the appropriate tables
Delete row from raw table
Is there anything I'm missing? Also, is there any way to have this run on the SQL Server so I don't have to have my SQL Server Management Studio open the whole time?
Yes, obviously, you can automate it with t-sql. But I recommened you to use SSIS in this case. As you say, structure of all tables are the same than you can make some ETL process and then you just change table name in the source. Consecuently, you will have the folowwing advantages:
Solve issue with couple of clicks
Low risk of errors
You will able to use the number of data transformations

Finding relationship between columns of 2 SQL Server tables

In my current environment there are hundreds of DB tables with 20% master tables. The values of some of the columns is a subset of a column from the master table. Unfortunately it is not documented.
Wondering if the master tables and dependent tables are known, how to craft a SQL statement to get which column of the dependent table is a subset of the master table column.
Thanks,
cabear

Resources