remove redundancy in a file based on two fields, using awk - arrays

I'm trying to remove duplicate lines in a very large file (~100,000 records) according to the values of the first two columns without taking into account their order, and then print those fields + the other columns.
So, from this input:
A B XX XX
A C XX XX
B A XX XX
B D XX XX
B E XX XX
C A XX XX
I'd like to have:
A B XX XX
A C XX XX
B D XX XX
B E XX XX
(That is, I want to remove 'B A' and 'C A' because they already appear in the opposite order; I don't care about what's in the next columns but I want to print it too)
I've the impression that this should be easy to do with awk + arrays, but I can't come with a solution.
So far, I'm tinkering with this:
awk '
NR == FNR {
h[$1] = $2
next
}
$1 in h {
print h[$1],$2}' input.txt
I'm storing the second column in an array indexed by the first (h), and then check if there are occurrences of the first field in the stored array. Then, print the line. But something's wrong and I have no output.
I'm sorry because my code is not helpful at all but I'm kind of stuck with this.
Do you have any ideas?
Thanks a lot!

Just keep track of the things that appear on the two formats:
$ awk '!seen[$1,$2]++ && !seen[$2,$1]++' file
A B XX XX
A C XX XX
B D XX XX
B E XX XX
Which is equivalent to awk '!(seen[$1,$2]++ || seen[$2,$1]++)' file.
Note it is also equivalent to not having ++ the second expression (see comments):
awk '!seen[$1,$2]++ && !seen[$2,$1]' file
Explanation
The typical approach to print unique lines is:
awk '!seen[$0]++' file
This creates an array seen[] whose indexes are the lines that have appeared so far. So if it is new, seen[$0] is 0 and gets incremented to 1. But previously it is printed because the expression ! var ++ evaluates ! var first (and in awk, True triggers the action of printing the current line). When the line has been seen already, seen[$0] has a positive value, so !seen[$0] is false and doesn't trigger the printing action.
In your case you want to keep track of what appeared, no matter the order, so what I am doing is to store the indexes in both possible positions.

use as below
$awk '{if( $1$2 in a == 0 && $2$1 in a == 0 ) a[$1$2]=$0; } END{ for(i in a)print a[i]; }' input.txt
Explanation:
command is storing the record in array (a) with array key as combination of first and second field (i.e $1$2 and $2$1) is not already present in array. Once complete file is read then print the array (a).
# ($1$2 in a) => checks if there is any key with $1$2 in array a
# if it's not present then it return 0
# and if both combination $1$2 and $2$1 are not present then store the record in array a
if( $1$2 in a == 0 && $2$1 in a == 0 ) a[$1$2]=$0;
# below print the array a (which stores complete unique record) at the end
END{ for(i in a) print a[i]; }'

Related

How to access even elements of array in bash

I want to echo the even elements of an array in bash, how could this be achieved?
Assuming your array is not sparse (contains no gaps),
Assuming by even you start counting from 1 (and not 0 like bash does), you can do that with a loop on the indexes:
array=(a b c d e f g h)
for index in "${!array[#]}"; do
(( index % 2 )) && echo "${array[index]}"
done
:
outputs:
b
d
f
h
Assuming you're talking about an indexed rather than associative array and you want the values for the even numbered indices rather than the even number values - loop from zero to array size incrementing the index by 2 on each iteration.
Borrowing #Camunsensei's example:
array=(a b c d e f g h)
for (( index=0; index<${#array[#]}; index+=2 )); do
printf 'array[%d]=%q\n' "$index" "${array[index]}"
done
array[0]=a
array[2]=c
array[4]=e
array[6]=g
If that's not what you need then editing your question to include some sample input, expected output, and what you've tried so far would help a lot.

How to store the multiple positions of a character in a string inside an array in free format RPGLE?

In standard RPGLE, my code looks like this. This statement stores the positions of the commas in Data in ComArr array.
C ',' Scan Data ComArr
I tried doing it in free format like this. But all the indices of ComArr array is loaded with the first position of comma in Data. This is because %Scan returns only one position and upon saving it to an array ends up loading the whole array with a single value.
ComArr = %Scan(',':Data) ;
Is there any other method to process SCAN in free format RPGLE like it does in C spec? Basically I want to split the string separated by a delimiter.
One possibility is to keep the C-spec as-is. If the code block needs an array of delimiter positions, and one line of code already does that, put a comment above the fixed-format spec describing what it does and leave it in there.
If /free is required and you don't want to replace the entire block of code, you will need to roll your own loop to build the array of delimiters.
I don't personally convert from fixed to /free unless I am re-writing the block of code to be functionally different. That is, I would almost certainly write a different algorithm in /free than I would have written in fixed. So the entire process of building an array of delimiter positions and then splitting the string based on that array is not something I would do in /free.
I would write a new sub-procedure that returns an array of strings given one delimited input string. The code inside that sub-procedure would make one pass through the input, looking for delimiters with %scan(), and for each one found, split the substring into the next available output array element. There's no need for an array of delimiter positions with this sort of algorithm.
This is probably a little late, but if anyone else needs to split a string by a given delimeter, this code should do what you need.
If you assign a value to an array using wildcard eval array(*) = ..., it applies to every element of the array.
Declare the prototype in your source:
D split pr 1024a varying
D string 65535a varying const options(*varsize)
D delims 50a varying const
D pos 10i 0
Declare a couple of variables.
This assumes your input string is 1000 characters and each separated item is 10 characters maximum:
D idx s 10i 0
D list s 1000a
D splitAry s 10a dim(100)
This is how you split the string.
This tells the routine your delimeter is a comma:
c eval idx = 0
c eval splitAry(*) = split(list:',':idx)
Define the procedure that does the work:
*-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
* split - Split delimited string
*-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Psplit b export
D split pi 1024a varying
D iString 65535a varying const options(*varsize)
D iDelims 50a varying const
D iPos 10i 0
*
D result s 1024a varying
D start s 10i 0
D char s 1a
*-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
c eval start = iPos + 1
c eval %len(result) = 0
*
c for iPos = start to %len(iString)
c eval char = %subst(iString:iPos:1)
c if %check(iDelims:char) = 1
c eval result = result + char
c else
c leave
c endif
c endfor
*
c return result
Psplit e
Don't forget to add dftactgrp(*no) to your H spec if you're defining and using this in the same module!

How to remove redundant (similar items) from a 2 dimensional array in perl

I am rather new to perl, but so far I have found it a very strong language.
Every month I pull an extract from a license register for a product that I am managing, and the data is in CSV format.
I have managed to complete the code to get a sorted list, and also sorted as per my requirements. The list is some 1200 rows.
The format of the list looks like this (I have kept only the vital parts):
Customer;CustomerID;ProductLine;Platform;Version
operatorx;1234;XX;Linux;15
operatorx;1234;YY;x86;7
operatorx;1234;ZZ;Sparc;7
operatory;2345;YY;x86;8
operatory;2345;YY;Sparc;7.1
operatory;2345;ZZ;x86;7.2
The output wanted is like this for the above:
Customer;CustomerID;ProductLine;Platform;Version
operatorx;1234;XX;Linux;15
operatory;2345;YY;x86;8
operatory;2345;ZZ;x86;7.2
My list in the code does not contain any ';', the values are stored in a array like this:
#sortedlist = ([Customer,customerID,ProductLine,Platform,Version])
So any customer can have many rows in my original list, but if the product is XX, then only the first occurrence in the list should be kept, and no occurrences of product YY or ZZ can be kept.
If a customer has no product XX, then the first occurrence of product YY and first occurrence of product ZZ should be kept.
The list is sorted so that the "best" entry is always the first per customerID.
I have tried a very simple code, checking that current customerID != prevCustomerID then push the row to a new list, but this makes me miss out when a customer has both products YY and ZZ...
I have also tried nesting a lot of if statements, to try to keep track of current row and previous row... but the code grew a lot, and still didn't give me the expected result :-(
I am starting to think that I approach this from the wrong angle, and I have tried to dig into hashes, but since a customer can actually have one or two entries in the final list, I think a hash is disqualified, as the key value here has to be customerID, and in a hash, there should only be one occurrence per customerID.
Does anyone have any idea on how to attack this problem?
Starting from the top, push the very first element to a new list, and then for each consecutive row, check if it exists in the new list, and what product the new list contains, and if product == XX, then scrap the rest for the same customerID, or if product in the new list == YY, scrap the rest until it finds product == ZZ for the same customerID. Then repeat the same, until it finds a new customerID?
--- updated ---
I managed to solve my issue using awk instead.
./myperlscript.pl input.csv | awk -F ';' '!array[$1,$2,$3]++'| awk -F ';' '{ {if ($2 != prev) {print $0; prev = $2; prevprod = $3}} {if ($2 = prev && prevprod != "XX") { prev =$2}}} > output.csv
But if anyone whould know how to achieve the same with standard perl, it would be very nice.
Here is a naive implementation in perl using state variables that achieves the same result. If you actually have more than 3 variables to consier (XX,YY,ZZ here), you could generalize this into a state array and a function that updates the array and decides what to do based on the state of the array.
filter.pl
#!/usr/bin/env perl
use warnings;
use strict;
my $last_customer = '';
my ($seen_xx, $seen_yy, $seen_zz);
while (my $line = <>) {
# Header
if ($. == 1) {
print $line;
next;
}
# Data
my ($customer_name, $customer_id, $product_line, $platform, $version) = split /;/, $line;
die "Unable to parse line : $line"
unless defined $customer_name;
if ($customer_name ne $last_customer) {
$last_customer = $customer_name;
($seen_xx, $seen_yy, $seen_zz) = (0,0,0); # Reset
}
if (not $seen_xx and $product_line eq 'XX') {
# Print first XX
print $line;
($seen_xx, $seen_yy, $seen_zz) = (1,1,1); # Ignore the others
}
if (not $seen_yy and $product_line eq 'YY') {
# Print first YY if no XX
print $line;
$seen_yy = 1;
}
if (not $seen_zz and $product_line eq 'ZZ') {
# Print first ZZ if no XX
print $line;
$seen_zz = 1;
}
}
Output
cat input | perl filter.pl
Customer;CustomerID;ProductLine;Platform;Version
operatorx;1234;XX;Linux;15
operatory;2345;YY;x86;8
operatory;2345;ZZ;x86;7.2

Search pattern and print hits lower than threshold

Here is an example what I need:
INPUT:
a 5
a 7
a 11
b 10
b 11
b 12
.
.
.
OUTPUT:
a 2
b 0
So on output should be hits lower than my threshold (in this case it is $2 < 10).
My code is:
awk 'OFS="\t" {v[$1]+=$2; n[$1]++} END {for (l in n) {print l, n[l]} }' input
and my output is
a 3
b 3
I am not sure where to put condition $2 < 10.
You can check the threshold condition with something like $2 < value, where value is an awk variable given with -v value=XX.
Also, you are using v[$1]+=$2: this sums, not counts the matching cases.
All together, I would use this:
awk -v t=10 '{list[$1]} $2<t {count[$1]++} END {for (i in list) print i, count[i]+0}' file
Note we need to use two arrays: one to keep track of the counters and another one the keep track of all possible values.
Explanation
-v t=10 provide threshold.
{list[$1]} keep track of all possible first fields appearing.
$2<t {count[$1]++} if the 2nd field is smaller than the threshold, increment the counter.
END {for (i in list) print i, count[i]+0} finally, loop through all the first fields and print the number of times they had a value lower than the threshold. The count[i]+0 trick makes it print 0 if the value is not set.
Test
$ awk -v t=10 '{list[$1]} $2<t {count[$1]++} END {for (i in list) print i, count[i]+0}' a
a 2
b 0

Algorithm for maintaining an "ordering string" for ordering database elements

I have a database in which I'd like to store an arbitrary ordering for a particular element. The database in question doesn't support order sets, so I have to do this myself.
One way to do this would be to store a float value for the element's position, and then take the average of the position of the surrounding elements when inserting a new one:
Item A - Position 1
Item B - Position 1.5 (just inserted).
Item C - Position 2
Now, for various reasons I don't wish to use floats, I'd like to use strings instead. For example:
Item A - Position a
Item B - Position aa (just inserted).
Item C - Position b
I'd like to keep these strings as short as possible since they will never be "tidied up".
Can anyone suggest an algorithm for generating such string as efficiently and compactly as possible?
Thanks,
Tim
It would be reasonable to assign 'am' or 'an' position to Item B and use binary division steps for another insertions.
This resembles 26-al number system, where 'a'..'z' symbols correspond to 0..25.
a b //0 1
a an b //insert after a - middle letter of alphabet
a an au b //insert after an
a an ar au b //insert after an again (middle of an, au)
a an ap ar au b //insert after an again
a an ao ap ar au b //insert after an again
a an ann ao... //insert after an, there are no more place after an, have to use 3--symbol label
....
a an anb... //to insert after an, we treat it as ana
a an anan anb // it looks like 0 0.5 0.505 0.51
Pseudocode for binary tree structure:
function InsertAndGetStringKey(Root, Element): string
if Root = nil then
return Middle('a', 'z') //'n'
if Element > Root then
if Root.Right = nil then
return Middle(Root.StringKey, 'z')
else
return InsertAndGetStringKey(Root.Right, Element)
if Element < Root then
if Root.Left = nil then
return Middle(Root.StringKey, 'a')
else
return InsertAndGetStringKey(Root.Left, Element)
Middle(x, y):
//equalize length of strings like (an, anf)-> (ana, anf)
L = Length(x) - Length(y)
if L < 0 then
x = x + StringOf('a', -L) //x + 'aaaaa...' L times
else if L > 0 then
y = y + StringOf('a', L)
if LL = LastSymbol(x) - LastSymbol(y) = +-1 then
return(Min(x, y) + 'n') // (anf, ang) - > anfn
else
return(Min(x, y) + (LastSymbol(x) + LastSymbol(y))/2) // (nf, ni)-> ng
As stated the problem has no solution. Once an algorithm has generated strings 'a' and 'aa' for adjacent elements there is no string which can be inserted between them. This is a fatal problem for the approach. This problem is independent of the alphabet used for the strings: replace 'a' by 'the first letter in the alphabet used' if you wish.
Of course, it can be worked around by changing the ordering string for other elements when this impasse is reached, but that seems to be beyond what OP wants.
I think that the problem is equivalent to finding an integer to represent the order of an element and finding that, say, 35 and 36 are already used to order existing elements. There is simply no integer between 35 and 36, no matter how hard you look.
Use real numbers, or a computer approximation such as floating-point numbers, or rationals.
EDIT in response to OP's comment
Just adapt the algorithm for adding 2 rationals: (a/b)+(c/d) = (ad+cb)/bd. Take (ad+cb)/2 (rounding if you want or need) and you have a rational midway between the first two.
Are capitals an option?
If so, I would use them to insert between otherwise adjacent values.
For instance to insert between
a
aa
You could do:
a
aAaa <--- this cap. tells there is one more place between adjacent small values .ie. a[Aa]a
aAba
aAca
aBaa
aBba
aa
Now if you need to insert between a and aAaa
You could do
a
aAAaaa <--- 2 caps. tells there are two more places between adjacent small values i.e. a[AAaa]a
aAAaba
aAAaca
...
aAAbaa
aAaa
In terms of being compact or efficient I make no claims...

Resources