How to remove first joined row Talend DI - file

How to delete first matching row in a file using a second one ?
I use Talend DI 7.2 and I need to delete some rows in one delimited file using a second one containing the rows to delete. My first file contains multiple rows matching the second one but for each row in my second file I need to delete only the first row matching in the first file.
For example :
File A : File B :
Code | Amount Code | Amount
1 | 45 1 | 45
1 | 45 3 | 70
2 | 50 3 | 70
2 | 60
3 | 70
3 | 70
3 | 70
3 | 70
At the end, I need to obtain :
File A :
Code | Amount
1 | 45
2 | 50
2 | 60
3 | 70
3 | 70
Only the first match in file A for each row in file B is missing.
I tried with tMap and tFilterRow but it matches all rows not only the first one.
Example edited : I can have many times the same couple code-amount in file B and I need to remove this same number of rows from file A

You can do this by using Variables within the Tmap. I created 3:
v_match - return "match" if code and amount are in lookup file b.
v_count - add to the count if it's a repeating value. otherwise reset to 0
v_last_row - set the value of v_match to this before comparing again. this way we can compare current row to last row and get counts
Then add an Expression filter to remove any first match.
This will give the desired results:

You can't delete rows from a file, so you'll have to generate a new file containing only the rows you want.
Here's a simple solution.
First, join your files using a left join between A as a main flow, and B as a lookup.
In the tMap, using an output filter, you only write to the output file the rows from A that don't match anything in B (row2.code == null) or those which have a match, but not a first match.
The trick is to use a Numeric.sequence, with the code as an id of the sequence; if the sequence returns a value other than 1, you know you've already had that line previously. If it's the first occurence of the code, the sequence would start at 1 and return 1, so the row is filtered out.

Related

Equivalent of Excel Pivoting in Stata

I have been working with country-level survey data in Stata that I needed to reshape. I ended up exporting the .dta to a .csv and making a pivot table in in Excel but I am curious to know how to do this in Stata, as I couldn't figure it out.
Suppose we have the following data:
country response
A 1
A 1
A 2
A 2
A 1
B 1
B 2
B 2
B 1
B 1
A 2
A 2
A 1
I would like the data to be reformatted as such:
country sum_1 sum_2
A 4 4
B 3 2
First I tried a simple reshape wide command but got the error that "values of variable response not unique within country" before realizing reshape without additional steps wouldn't work anyway.
Then I tried generating new variables conditional on the value of response and trying to use reshape following that... the whole thing turned into kind of a mess so I just used Excel.
Just curious if there is a more intuitive way of doing that transformation.
If you just want a table, then just ask for one:
clear
input str1 country response
A 1
A 1
A 2
A 2
A 1
B 1
B 2
B 2
B 1
B 1
A 2
A 2
A 1
end
tabulate country response
| response
country | 1 2 | Total
-----------+----------------------+----------
A | 4 4 | 8
B | 3 2 | 5
-----------+----------------------+----------
Total | 7 6 | 13
If you want the data to be changed to this, reshape is part of the answer, but you should contract first. collapse is in several ways more versatile, but your "sum" is really a count or frequency, so contract is more direct.
contract country response, freq(sum_)
reshape wide sum_, i(country) j(response)
list
+-------------------------+
| country sum_1 sum_2 |
|-------------------------|
1. | A 4 4 |
2. | B 3 2 |
+-------------------------+
In Stata 16 up, help frames introduces frames as a way to work with multiple datasets in the same session.

How can I use the LAG function and return the same value if the subsequent value in a row is duplicated?

I am using the LAG function to move my values one row down.
However, I need to use the same value as previous if the items in source column is duplicated:
ID | SOURCE | LAG | DESIRED OUTCOME
1 | 4 | - | -
2 | 2 | 4 | 4
3 | 3 | 2 | 2
4 | 3 | 3 | 2
5 | 3 | 3 | 2
6 | 1 | 3 | 3
7 | 4 | 1 | 1
8 | 4 | 4 | 1
As you can see, for instance in ID range 3-5 the source data doesn't change and the desired outcome should be fed from the last row with different value (so in this case ID 2).
Sql server's version of lag supports an expression in the second argument to determine how many rows back to look. You can replace this with some sort of check to not look back e.g.
select lagged = lag(data,iif(decider < 0,0,1)) over (order by id)
from (values(0,1,'dog')
,(1,2,'horse')
,(2,-1,'donkey')
,(3,2,'chicken')
,(4,23,'cow'))f(id,decider,data)
This returns the following list
null
dog
donkey
donkey
chicken
Because the decider value on the row with id of 2 was negative.
Well, first lag may not be the tool for the job. This might be easier to solve with a recursive CTE. Sql and window functions work over set. That said, our goal here is to come up with a way of describing what we want. We'd like a way to partition our data so that sequential islands of the same value are part of the same set.
One way we can do that is by using lag to help us discover if the previous row was different or not.
From there, we can now having a running sum over these change events to create partitions. Once we have partitions, we can assign a row number to each element in the partition. Finally, once we have that, we can now use the row number to look
back that many elements.
;with d as (
select * from (values
(1,4)
,(2,2)
,(3,3)
,(4,3)
,(5,3)
,(6,1)
,(7,4)
,(8,4)
)f(id,source))
select *,lag(source,rn) over (order by Id)
from (
select *,rn=row_number() over (partition by partition_id order by id)
from (
select *, partition_id = sum(change) over (order by id)
from (
select *,change = iif(lag(source) over (order by id) != source,1,0)
from d
) source_with_change
) partitioned
) row_counted
As an aside, this an absolutely cruel interview question I was asked to do once.

Excel - VLOOKUP to return each result in an array, not just the first

I am currently working between two workbooks.
In Workbook A I have the following data.
A ... D E F ... N
1.| ID | Name | Desc | Prod | Country|
2.| 12345 | Apple| Fruit| 10| US|
3.| 12346 | Celery| Veg| 150| US|
4.| 12347 | Mint| Herb| 25| FR|
I have been using the following formula from AHC in Workbook B, the aim is to perform a VLOOKUP which grabs all the ID's but only if the Country = "US".
=VLOOKUP("US", CHOOSE({2,1},Workbook A.xlsx!Table1[ID], Workbook A.xlsx!Table1[Country]), 2, FALSE)
This formula works well, however, my problem comes because the formula will only ever return the first instance in the array. For example, if I include this formula in Workbook B, Col A it will look like this:
A
1.|ID of US|
2.| 12345 |
3.| 12345 |
4.| 12345 |
5.| 12345 |
6.| 12345 |
7.| 12345 |
How would I make this formula work so that it returns each ID which matches "US", not just the first occurrence of a match?
In B2 put this formula:
You might need to adjust the rows of the ranges (I went till 100).
={ISERROR(INDEX(D$2:D$100,SMALL(IF(N$2:N$100=$A2,ROW(D1)),ROW(N1))),"")}
NOTE:
Step 1) Insert the formula only in Cell B2 without the {}
Step 2) Once the formula is inserted mark the entire formula and press Ctrl + Shift + Enter so the formula will get the {}
Step 3) Drag it down the rows as far as you need it to get the list.

Array sorting in matlab

Hi i have a 289x2 array that i want to sort in MatLab. I want to sort the first column into numerical ascending order. However I want to keep the second column entry that is associated with it. Best way to explain is through an example.
x = 76 1
36 2
45 3
Now I want to sort x so that it returns an array that looks like:
x = 36 2
45 3
76 1
So the first column has been sorted into numerical order but has retained its second column value. So far I have tried sort(x,1). This sorts the first column as i want but does not keep the pairing. This returns x as:
x = 36 1
45 2
76 3
Any help would be great. Cheers!!
This is exactly what sortrows does.
x=sortrows(x); % or x=sortrows(x,1);
or if you want to use sort then get the sorted indexes first and then arrange the rows accordingly like this:
[~, idx] = sort(x); %Finding the sorted indexes
x = x(idx(:,1),:) ; %Arranging according to the indexes of the first column
Output for both approaches:
x =
36 2
45 3
76 1

SQLite - find a certain value and find the highest value on another column on the row input value appears then select all the elements on that row

I am using sqlite3 library on a c project. And I have a database like this:
| Time | Message | ID | Temp |
|--------|-----------|------|--------|
|09:05:37| 1514 | 62 | 35 |
|10:14:45| 1515 | 43 | 14 |
|11:18:50| 1997 | 28 | 43 |
|08:04:23| 1998 | 28 | 50 |
Message is an autoincrement value. What I am trying to do and my questions are:
int ID_in;
printf("Enter the ID you want to check: ");
scanf("%d", &ID_in);
Firstly user will input a value and I have to find does that value appear on ID column and if it appears I can start my operations. Since user will input a value should I use a sqlite3_bind() function on my database to compare the ID_in with the values on database?
If the ID_in appears on the ID column I have to find the highest value on the Message column which is on the same row with ID_in value since there may be more on than one ID_in value on the ID column. For example if the ID_in is 28 I have to access the row which Message value is highest, for this one I have to access the row where Message value is 1998.
After I access that row I should print all the columns on that row. For example if ID_in is 28 find the value 28 on ID column. Then find the highest Message column value which is 1998. Then print Time, Message, ID, Temp on that row which are 08:04:23, 1998, 28, 50 for this example.
Yes, you can use prepared statement and sqlite_bind for passing the user input to query.
For selecting the row with highest messsage value you can use,
Select * from [table_name] where id = [user_input] order by message desc limit 1;

Resources