data migration in informatica

data migration in informatica - database

A large amount of data is coming from source to target. After a successful insertion in target, we have to change the status to every rows as "committed". But when will we know that all datas have come or not in target without directly querying the source?
For example - suppose 10 records have migrated to target from source.
We cannot change the status of all the records as "committed" before successful insertion of all records in target.
So before changing the status of all the records, how will we know that 11th record is coming or not?
Is there anything that will give me the information about total records in source?
I need a real-time based answer.

we had the same scenario and this is what we did:
First of all
to check if data is loaded in target you can join source and target table, update will lock the rows so for this commit must be fired at database level in target table (so that lock for update can happen).
after joining, update the loaded data based on join with target column.
Few things.
You have to stop you session (used pmcmd to stop session in command task)
update data in your source table and restart session.
keep load for counter of 20k-30 rows so update goes smoothly.

Related

Lock and Update a records from DataSet in DB

From learn.microsoft.com "Populating a DataSet from a DataAdapter"
Pulling all of the table to the client also locks all of the rows on the server.
I didn't find any information (in Namespace: System.Data) regarding possibility to put lock on records (or group records) in DB, that was read to a DataSet (DataTable), which can affect all users of the DB, but not only those who will work with the database through my application.
Also From learn.microsoft.com "Using UpdatedRowSource to Map Values to a DataSet"
The Update method resolves your changes back to the data source; however other clients may have modified data at the data source since the last time you filled the DataSet. To refresh your DataSet with current data, use the DataAdapter and Fill method. New rows will be added to the table, and updated information will be incorporated into existing rows. The Fill method determines whether a new row will be added or an existing row will be updated by examining the primary key values of the rows in the DataSet and the rows returned by the SelectCommand. If the Fill method encounters a primary key value for a row in the DataSet that matches a primary key value from a row in the results returned by the SelectCommand, it updates the existing row with the information from the row returned by the SelectCommand and sets the RowState of the existing row to Unchanged. If a row returned by the SelectCommand has a primary key value that does not match any of the primary key values of the rows in the DataSet, the Fill method adds a new row with a RowState of Unchanged.
If you we have modified copy records from DB (locally in-memory) in DataSet, and want to propogand these chanes to server, why we must refresh our local record? It can rejects all our changes.
I doesn't understand, in common, the strategy for organization modification a records in DB throught DataSet:
make local copy records (by "Addapter.Fill(Dataset)")
change record (or records) locally (some contunies time) and wait when user click "Update", when:
save all modification in temp table?
reread records from DB (again by "Addapter.Fill(Dataset)")?
compare records from temp table with updated Dataset?
And if any nothing is changed, quickly to update records in DB (by
"Addapter.Update(Dataset)")?
But also in that case, It's a possibly that someone would be more quickly than I (and can update "My records" between the my reread and my update?

Rereaded all articles about ADO.NET (from learn.microsoft.com), again, and found some additive information and can answer on own questions:
1) The reading records from DB to DataSet (by Addapter.Fill(Dataset) doesn’t put any lock on records in DB (the foregoing phrase from learn.microsoft.com “Pulling all of the table to …” isn’t correct).
2) “In a multiuser environment, there are two models for updating data in a database: optimistic concurrency and pessimistic concurrency. The DataSet object is designed to encourage the use of optimistic concurrency (ONLY) for long-running activities, such as remoting data and interacting with data.”
A transaction (DbTransaction Class-derived types) which consists of a single command or a group of commands and which execute as a package, can put different types of locks on rows in tables in DB (DbTransaction.IsolationLevel Property).
Not correct use of transaction can very badly influence on work DB in multi-user env (the transactions must keep as short as possible).
3) When we talk regarding a locking records in DB, we must clear understood, what the aim of locking records and who it will be corresponds with logic work of our Application.
By example, we want made system for selling tickets for cinema (to simplify, for one movie and for one session only).
Application (which will run on multi-devices) must connect to DB and
fill local DataSet, and show for users all available seats (all rows
which have in special DB field SeatStatus = “FREE”).
After “selecting seat” user must press “Make Reservation” (MakeReservation()). Method MakeReservation() must use the ”Testing for Optimistic Concurrency Violations” (see example from Microsoft it more elegant than my from first post) trying to change value of the field SeatStatus to “RESERVED”, in this time will use the “optimistic concurrency” – who will press first, that “receive seat”.
Who will “second” receive message “Sorry, place is reserved” and Update for current list of available seats (UpdateSeats()).
Also, UpdateSeats() must periodically run on all active devices (one
per sec).
User who first pressed button, on next screen must entered
credit card information and press “Pay” (PayTicket()).
Method PayTicket() must connect to Bank, check payment and change
status SeatStatus to “OCCUPIED”, and this case (IMHO) more correct will use a some transaction with “pessimistic
concurrency”.
If User isn’t pressed “Pay” in some demanded time (5 min.) will
return to previous
screen and the field SeatStatus changed to “FREE”.
P.S>
Wellcome to all who knows more correct way to realize this task more correctly.

Find out the recently selected rows from a Oracle table and can I update a LAST_ACCESSED column whenever the table is accessed

I have a database table which have more than 1 million records uniquely identified by a GUID column. I want to find out which of these record or rows was selected or retrieved in the last 5 years. The select query can happen from multiple places. Sometimes the row will be returned as a single row. Sometimes it will be part of a set of rows. there is select query that does the fetching from a jdbc connection from a java code. Also a SQL procedure also fetches data from the table.
My intention is to clean up a database table.I want to delete all rows which was never used( retrieved via select query) in last 5 years.
Does oracle DB have any inbuild meta data which can give me this information.
My alternative solution was to add a column LAST_ACCESSED and update this column whenever I select a row from this table. But this operation is a costly operation for me based on time taken for the whole process. Atleast 1000 - 10000 records will be selected from the table for a single operation. Is there any efficient way to do this rather than updating table after reading it. Mine is a multi threaded application. so update such large data set may result in deadlocks or large waiting period for the next read query.
Any elegant solution to this problem?

Oracle Database 12c introduced a new feature called Automatic Data Optimization that brings you Heat Maps to track table access (modifications as well as read operations). Careful, the feature is currently to be licensed under the Advanced Compression Option or In-Memory Option.
Heat Maps track whenever a database block has been modified or whenever a segment, i.e. a table or table partition, has been accessed. It does not track select operations per individual row, neither per individual block level because the overhead would be too heavy (data is generally often and concurrently read, having to keep a counter for each row would quickly become a very costly operation). However, if you have you data partitioned by date, e.g. create a new partition for every day, you can over time easily determine which days are still read and which ones can be archived or purged. Also Partitioning is an option that needs to be licensed.
Once you have reached that conclusion you can then either use In-Database Archiving to mark rows as archived or just go ahead and purge the rows. If you happen to have the data partitioned you can do easy DROP PARTITION operations to purge one or many partitions rather than having to do conventional DELETE statements.

I couldn't use any inbuild solutions. i tried below solutions
1)DB audit feature for select statements.
2)adding a trigger to update a date column whenever a select query is executed on the table.
Both were discarded. Audit uses up a lot of space and have performance hit. Similary trigger also had performance hit.
Finally i resolved the issue by maintaining a separate table were entries older than 5 years that are still used or selected in a query are inserted. While deleting I cross check this table and avoid deleting entries present in this table.

How to find out the rows affected in SQL Profiler or trace?

I'm using tracing to log all delete or update queries run through the system. The problem is, if I run a query like DELETE FROM [dbo].[Artist] WHERE ArtistId>280, I know how many rows were deleted but I'm unable to find out which rows were deleted (the data they had).
I'm thinking of doing this as a logging system so it would be useful to see which rows were affected and what data they had if at all possible. I don't really want to use triggers for this job but I will if I have to (and if it's feasible).

If you need the original data and are planning on storing all the deleted data in a separate table why not just logically delete the original data rather than physically delete it? i.e.
UPDATE dbo.Artist SET Artist_deleted = 1 WHERE ArtistId>280
Then you only need add one column to your current table rather than creating new tables and scripts to support these. You could then partition the current table based on the deleted flag if you are worried about disk space/performance etc.

How to control which rows were sent via SSIS

I'm trying to create SSIS package which will periodically send data to other database. I want to send only new records(I need to keep sent records) so I created status column in my source table.
I want my package to update this column after successfuly sending data, but I can't update all rows wih "unsent" status because during package execution some rows may have been added, and I also can't use transactions(I mean on isolation levels that would solve my problem: I can't use Serializable beacause i musn't prevent users from adding new rows, and Sequence Container doesn't support Snapshot).
My next idea was to use recordset and after sending data to other db use it to get ids of sent rows, but I couldn't find a way to use it as datasource.
I don't think I should set status "to send" and then update it to "sent", I believe it would be to costly.
Now I'm thinking about using temporary table, but I'm not convinced that this is the right way to do it, am I missing something?

Record Set is a destination. You cannot use it in Data Flow task.
But since the data is saved to a variable, it is available in the Control flow.
After completing the DataFlow, come to the control flow and create a foreach component that can run on the ResultSet varialbe.
Read each Record Set value into a variable and use it to run an update query.
Also, see if "Lookup Transform" can be useful to you. You can generate rows that match or doesn't match.
I will improve the answer based on discussions

What you have here is a very typical data mirroring problem. To start with, I would not simply have a boolean that signifies that a record was "sent" to the destination (mirror) database. At the very least, I would put a LastUpdated datetime column in the source table, and have triggers on that table, on insert and update, that put the system date into that column. Then, every day I would execute an SSIS package that reads the records updated in the last week, checks to see if those records exist in the destination, splitting the datastream into records already existing and records that do not exist in the destination. For those that do exist, if the LastUpdated in the destination is less than the LastUpdated in the source, then update them with the values from the source. For those that do not exist in the destination, insert the record from the source.
It gets a little more interesting if you also have to deal with record deletions.
I know it may seem wasteful to read and check a week's worth, every day, but your database should hardly feel it, it provides a lot of good double checking, and saves you a lot of headaches by providing a simple, error tolerant algorithm. Some record does not get transferred because of some hiccup on the network, no worries, it gets picked up the next day.
I would still set up the SSIS package as a server task that sends me an email with any errors, so that I can keep track. Most days, you get no errors, and when there are errors, you can wait a day or resolve the cause and let the next days run pick up the problems.

I am doing a similar thing, in my case, I have a status on the source record.
I read in all records with a status of new.
Then use a OLE DB Command to execute SQL on each row, changing
the status to "In progress"(in you where, enter a ? as the value in
the Component Property tab, and you can configure it as a parameter
from the table row like an ID or some pk in the Column Mappings
tab).
Once the records are processed, you can change all "In Progress"
records to "Success" or something similar using another OLE DB
Command.
Depending on what you are doing, you can use the status to mark records that errored at some point, and require further attention.

SSIS and CDC - Incorrect state at end of "Mark Processed Range"

The Problem
I currently have CDC running on a table named subscription_events. The corresponding CT table is being populated with new inserts, updates, and deletes.
I have two SSIS flows that move data from subscription_events into another table in a different database. The first flow is the initial flow and has the following layout:
The Import Rows Into Vertica step simply has a source and a destination and copies every row into another table. As a note, the source table is currently active and has new rows flowing into it every few minutes. The Mark Initial Load Start/End steps store the current state in a variable and that is stored in a separate table meant for storing CDC names and states.
The second flow is the incremental flow and has the following layout:
The Import Rows Into Vertica step uses a CDC source and should pull the latest inserts, updates, and deletes from the CT table and these should be applied to the destination. Here is where the problem resides; I never receive anything from the CDC source, even though there are new rows being inserted into the subscription_events table and the corresponding CT table is growing in size with new change data.
To my understanding, this is how things should work:
Mark Initial Load Start
CDC State should be ILSTART
Data Flow
Mark Initial Load End
CDC State should be ILEND
Get Processing Range (First Run)
CDC State should be ILUPDATE
Data Flow
Mark Processed Range (First Run)
CDC State should be TFEND
Get Processing Range (Subsequent Runs)
CDC State should be TFSTART
Data Flow
Mark Processed Range (Subsequent Runs)
CDC State should be TFEND
Repeat the last three steps
This is not how my CDC states are being set, though... Here are my states along the same process.
Mark Initial Load Start
CDC State is ILSTART
Data Flow
Mark Initial Load End
CDC State is ILEND
Get Processing Range (First Run)
CDC State is ILUPDATE
Data Flow
Mark Processed Range (First Run)
CDC State is ILEND
Get Processing Range (Subsequent Runs)
CDC State is ILUPDATE
Data Flow
Mark Processed Range (Subsequent Runs)
CDC State is ILEND
Repeat the last three steps
I am never able to get out of the ILUPDATE/ILEND loop, so I am never able to get any new data from the CT table. Why is this happening and what can I do to fix this?
Thank you so much, in advance, for your help! :)
Edit 1
Here are a couple of articles that sort of describe my situation, though not exactly. They also did not help me resolve this issue, but it might help you think of something I can try.
http://www.bradleyschacht.com/understanding-the-cdc-state-value/
http://msdn.microsoft.com/en-us/library/hh231087.aspx
The second article includes this image, which shows the ILUPDATE/ILEND loop I am trapped in.
Edit 2
Last week (May 26, 2014) I disabled then re-enabled CDC on the subscription_events table. This didn't change anything, so I then disabled CDC on the entire database, re-enabled CDC on the database, and then enabled CDC on the subscription_events table. This did make CDC work for a few days (and I thought the problem had been resolved by going through this process). However, at the end of last week (May 30, 2014) I needed to re-load the entire table via this process, and I ran into the same problem again. I'm still stuck in this loop and I'm not sure why or how to get out of it.
Edit 3
Before I was having this problem, I was having a separate issue which I posted about here:
CDC is enabled, but cdc.dbo<table-name>_CT table is not being populated
I am unsure if these are related, but figured it couldn't hurt to provide it.

I had the same problem.
I have an initial load package for kickking things off and a separate incremental load package for loading updates on a schedule.
You fix it by putting a "Mark CDC start" CDC Control Task at the end of you Initial Load package only. That will leave the state value in a TFEND state, which is what you want for when your Incremental load starts.