Pentaho ETL Table Input Iteration - loops

Context
Im having a table with Customer information. I want to find out the repeat customers in the table based on information like:
First_Name
Last_Name
DOB
Doc_Num
FF_Num
etc.
Now to compare one customer with the rest of the records in the same table, I need to:
read one record at a time
and compare this record with the rest in such a way that if a column does not match
then I need to compare the other columns for the records
Question
Is there a way to make the Table_Input step read or output one record at a time but it should read the next record automatically after the processing of the previous record is complete? This process should continue till all the records in the table are checked/ processed.
Also, would like to know if we can Iterate the same procedure instead of reading one record at a time from Table_Input?

To make your Table Input read and write row by row, doesn't see like the best solution and I don't think it would achieve what you want (e.g. keeping a track of previous records).
You could try using the Unique rows step, that can redirect a duplicate row (using the key you want) to another flow where it will be treated differently (or delete it if you don't want it). From what I can see you'll want to have multiple Unique rows to check each one of the columns.

Is there a way to make the Table_Input step read or output one record at a time but it should read the next record automatically after the processing of the previous record is complete?
Yes it is possible to change the buffer rows in between the steps. You can change the Nr of Rows in rowset to 1. But it is not recommended to change this property unless you run low on memory. This might make the tool behave abnormally.
Now as per the comments shared, i see there are two questions:
1. You need to check the count of duplicate entries:
You can achieve this result either using a Group By step or using the Unique step as answered by astro11. You can get the count of names easily and if the count is greater than 1, you can consider it as duplicate.
2. Checking on the two data rows:
You want to validate two names (for e.g.) like "John S" and "John Smith". Both are names should ideally be considered as a single name, hence a duplicate.
First of all this is a data quality issue and no tool will consider these rows as same. What you can do is to use a step called "Fuzzy match". This step based on the algorithms you choose will try to give you the measure of the closest match of Names. But for achieving this you need to have a seperate MASTER table with all the possible names. You can use "Jaro Winkler" algo to get the closest match.
Hope this helps :)

Related

Add DATE column to store when last read

We want to know what rows in a certain table is used frequently, and which are never used. We could add an extra column for this, but then we'd get an UPDATE for every SELECT, which sounds expensive? (The table contains 80k+ rows, some of which are used very often.)
Is there a better and perhaps faster way to do this? We're using some old version of Microsoft's SQL Server.
This kind of logging/tracking is the classical application server's task. If you want to realize your own architecture (there tracking architecture) do it on your own layer.
And in any case you will need application server there. You are not going to update tracking field it in the same transaction with select, isn't it? what about rollbacks? so you have some manager who first run select than write track information. And what is the point to save tracking information together with entity info sending it back to DB? Save it into application server file.
You could either update the column in the table as you suggested, but if it was me I'd log the event to another table, i.e. id of the record, datetime, userid (maybe ip address etc, browser version etc), just about anything else I could capture and that was even possibly relevant. (For example, 6 months from now your manager decides not only does s/he want to know which records were used the most, s/he wants to know which users are using the most records, or what time of day that usage pattern is etc).
This type of information can be useful for things you've never even thought of down the road, and if it starts to grow large you can always roll-up and prune the table to a smaller one if performance becomes an issue. When possible, I log everything I can. You may never use some of this information, but you'll never wish you didn't have it available down the road and will be impossible to re-create historically.
In terms of making sure the application doesn't slow down, you may want to 'select' the data from within a stored procedure, that also issues the logging command, so that the client is not doing two roundtrips (one for the select, one for the update/insert).
Alternatively, if this is a web application, you could use an async ajax call to issue the logging action which wouldn't slow down the users experience at all.
Adding new column to track SELECT is not a practice, because it may affect database performance, and the database performance is one of major critical issue as per Database Server Administration.
So here you can use one very good feature of database called Auditing, this is very easy and put less stress on Database.
Find more info: Here or From Here
Or Search for Database Auditing For Select Statement
Use another table as a key/value pair with two columns(e.g. id_selected, times) for storing the ids of the records you select in your standard table, and increment the times value by 1 every time the records are selected.
To do this you'd have to do a mass insert/update of the selected ids from your select query in the counting table. E.g. as a quick example:
SELECT id, stuff1, stuff2 FROM myTable WHERE stuff1='somevalue';
INSERT INTO countTable(id_selected, times)
SELECT id, 1 FROM myTable mt WHERE mt.stuff1='somevalue' # or just build a list of ids as values from your last result
ON DUPLICATE KEY
UPDATE times=times+1
The ON DUPLICATE KEY is right from the top of my head in MySQL. For conditionally inserting or updating in MSSQL you would need to use MERGE instead

In SSIS, how do I get the number of rows returned from the Source that SHOULD be processed

I am working on a project to add logging to our SSIS packages. I am doing my own custom logging by implementing some of the event handlers. I have implemented the OnInformation event to write the time, source name, and message to the log file. When data is moved from one table to another, the OnInformation event will give me a message such as:
component "TABLENAME" (1)" wrote 87 rows.
In the event that one of the rows fails, and lets say only 85 rows were processed out of the expected 87. I would assume that the above line would read wrote 85 rows. How do I track how many rows SHOULD HAVE processed in this case? I would like to see something like wrote 85 of 87 rows. Basically, I think I need to know how to get the number of rows returned from the Source's query. Is there an easy way to do this?
Thank you
You can use the Row Count transaformation after the Data source and save it the variable. This is going to be number of rows to be processed. Once it got loaded into the Destination, you should use the Execute SQL Task in Control flow and use Select Count(*) from <<DestinationTable>> and save the count into the Other variable[You should use the Where clause in your query to identify the current load]. So you will have number rows processed for logging.
Hope this helps!
Not enough space in comments to provide feedback. Posting an incomplete answer as I need to leave for the day.
You are going to have trouble accomplishing what you are asking for. Based on your comments in Gowdhaman008's answer, the value of a variable is not visible outside of a Data flow until after the finalizer event fires (OnPostExecute, I think). You can cheat and get that data out by making use of a script task to count rows through and firing off events, custom or predefined, to reporting package progress. In fact, just capture the OnPipelineRowsSent event. That will record how many rows are passing through a particular juncture and time surrounding it. SSIS Performance Framework Plus, you don't have to do any custom work or maintenance on your stuff. Out of the box functionality is a definite win.
That said, you aren't really going to know how many rows are coming out of a source until it's finished. That sounds stupid and I completely agree but it's the truth. Imagine a simple case, an OLE DB Source that is going to send 1,000,000 rows straight into an OLE DB Destination. Most likely, not all 1M rows are going to start in the pipeline, maybe only 10k will be in the first buffer. Those buffers are pushed to the destination and now you know 10k rows out of 10k rows have been processed. Lather, rinse, repeat a few times and in this buffer, a row has a NULL where it shouldn't. Boom goes the dynamite and the process fails. We have had 60k rows flow into the pipeline and that's all we know about because of the failure.
The only way to ensure we have accounted for all the source rows is to put an asynchronous transformation into the the pipeline to block all downstream components until all the data has arrived. This will obliterate any chance you have of getting good performance out of your packages. You'd still be subject to the aforementioned restrictions on updating variables but your FireXEvent message would accurately describe how many rows could have been processed in the queue.
If you started an explicit transaction, you could do something ugly like an Execute SQL Task just to get the expected count, write that to a variable and then log rows processed but then you're double querying your data and you increase the likelyhood of blocking on the source system because of the double pump. And that's only going to work for something database like. The same concept would apply for a flat file except now you'd need a script task to read all the rows first.
Where this gets uglier is for a slow starting data source, like a web service. The default buffer size might cause the entire package to run much longer than it'd need to simple because we are waiting on the data to arrive Slow starts
What I'd do
I'd record my starting and error counts (and more) using the Row Count. This will help you account for all the data that came in and where it went. I'd then turn on the OnPipelineRowsSent event to allow me to query the log and see how many rows are flowing through it RIGHT NOW.
What you want is the Row Count transformation. Just add that to your data flow after your source query, and assign its output to a variable. Then you can write that variable to your log file.
Here is what I currently do. It's super tedious, but it works.
1)
2) I have a constant "1" value on all of the records. They are literally all the same.
3) Using a multicast step, I send the data flow off in 2 directions. Despite all being the same, we still have to sort by that constant value.
4) Use an aggregate step to aggregate on that constant and then resort it in order to join with the bottom data flow (it holds all of the actual data records with no aggregation).
Doing this allows me to have my initial row count.
Later on, shown below, is use a conditional split step and do the same thing again after applying your condition. If the row count is the same, everything is fine and there are no problems.
If the row count is not the same, something is wrong.
This is the general idea for the approach for solving your problem without having to use another data flow step.
TLDR:
Get a row count for 1 of the conditions by using a multicast, sort by some constant value, and aggregation step.
Do a sort and merge to grab the row count.
Use a conditional split and do it again.
If the pre and post row counts are the same, do this.
If the pre and post row counts are not the same, do that.
This MAY help if you have a column which has no bad data . Add a second Flat File Source to the package. Use the same connection as your existing File source. Choose the first column only and direct the output to a Row Count.

Does every record has an unique field in SQL Server?

I'm working in Visual Studio - VB.NET.
My problem is that I want to delete a specific row in SQL Server but the only unique column I have is an Identity that increments automatically.
My process of work:
1. I add a row in the column (the identity is being incremented, but I don't know the number)
2. I want to delete the previous row
Is there a sort of unique ID that every new record has?
It's possible that my table has 2 exactly the same records, just the sequence (identity) is different.
Any ideas how to handle this problem?
SQL Server has a few functions that return the generated ID for the last rows, each with it's own specific strengths and weaknesses.
Basically:
##IDENTITY works if you do not use triggers
SCOPE_IDENTITY() works for the code you explicitly called.
IDENT_CURRENT(‘tablename’) works for a specific table, across all scopes.
In almost all scenarios SCOPE_IDENTITY() is what you need, and it's a good habit to use it, opposed to the other options.
A good discussion on the pros and cons of the approaches is also available here.
I want to delete the previous row
And that is your problem. There is no such concept in SQL as a 'previous row'. The word previous implies order and order applies only to queries, where is achieved by adding an ORDER BY clause. Tables have no order. You need to rephrase this in terms of "I need to delete the record that satisfies <this> condition.". This may sound to you like pedantic gibberish, but you will never find a solution until you acknowledged the problem.
Searching for a way to interpret the value of the inserted identity column and then subtracting 1 from it is flawed with many many many problems. It is incorrect under concurrency. It is incorrect in presence of rollbacks. It is incorrect after ETL jobs. Overall, never expect monotonically increasing identities, they're free to jump gaps and your code should be correct in presence of gaps.

Updating the same row in oracle during a trigger?

Short question since I don't know how to search for this. Can I "re-update" the same row? For example, I have a field that stores the sub-total of a payment, and given my business constraints I can update that value. Can I update the total of the same row with just a trigger? Thank you beforehand!
By the way I'm using Oracle and PL/SQL.
Business rules: I have the following. There's a table that stores will pays data, another table that stores the monthly fees to be paid, and another one that stores the possible discounts. One will pay can only be discounted once, and the will pay stores the subtotal and the total. So, what I'm doing is... "when the discount information is being updated, after it's committed, update the total value and the values of the fees".
You can't update the table your trigger is firing against, you'll get an ORA-04091 mutating table error. You can update values in the row itself, using the :NEW syntax, as long as it's a 'before' trigger.
I'm unclear what you mean about the subtotal though; it sounds like you have a running total field on the table; if thats based on other records on the same table (e.g. you have multiple records for the same order, and you want an inserted record to have the sum of all the previous ones). If that is the case then you can't do that either, as you'd hit the same ORA-04901.
If you're updating a row then you could adjust a field, by setting for example :NEW.subtotal := :OLD.subtotal - :OLD.value + :NEW.value, but not sure what the benefit of that field would be.
It would be helpful to see what your business logic is and how it fits in with the insert/update, and what you want the trigger to do. Often with something like this you really want to be using a wrapper procedure around the insert/update, rather than a trigger, but it's a bit vague at present.
For the subtotal to remain accurate, I'd probably avoid trying to maintain it all, and instead use a view which has an analytic function calculating it for you. Much less hassle, in my experience.
Yes - a BEFORE INSERT for each row trigger can modify the values being inserted.

How should I represent a unique super-admin privilege in a database?

My project needs to have a number of administrators, out of which only one will have super-admin privileges.
What is the best way to represent this in the database?
There are a few ways to do this.
Number 1: Have a column on your administrator (or user) table called IsSuperAdmin and have an insert/update trigger to ensure that only one has it set at any given time.
Number 2: Have a TimestampWhenMadeSuperAdmin column in your table. Then, in your query to figure out who it is, use something like:
select user_id from users
where TimestampWhenMadeSuperAdmin is not null
order by TimestampWhenMadeSuperAdmin desc
fetch first 1 row only;
Number 3/4: Put the SuperAdmin user ID into a separate table, using either the trigger or last-person-made-has-the-power approach from numbers 1 or 2.
Personally, I like number 2 since it gives you what you need without unnecessary triggers, and there's an audit trail as to who had the power at any given time (though not a complete audit trail since it will only store the most recent time that someone was made a SuperAdmin).
The trouble with number 1 is what to do if you just clear the current SuperAdmin. Either you have to give the power to someone else, or nobody has it. n other words, you can get yourself into a situation where there is no SuperAdmin. And number 3 and 4 just complicate things with an extra table.
Use a roles/groups approach. You have a table containing all the possible roles, and then you have an intersect table containing the key of the user and the key of the role they belong to (there can be multiple entries per user as each user could have several roles (or belong to several groups)).
Also, don't call them super admin - just admin is fine, call the rest power user or something similar.
Simple, yet effective: UserId = 1. Your application will always know it is the SuperUser.

Resources