Deduplication in awk using multiple arrays - works, but shouldn't. Why? - arrays

Input data:
name date
G
A 2011-01-21
A
B
C 2011-02-04
D
D 2011-03-26
E 2011-05-13
F 2011-02-20
G 2011-05-10
G
H
A
My desired output is a list of distinct values from name and date disgarding rows containing where name is a duplicate and date is blank:
name date
A 2011-01-21
B
C 2011-02-04
D 2011-03-26
E 2011-05-13
F 2011-02-20
G 2011-05-10
H
My awk code below produces this result:
awk 'BEGIN { FS=OFS="\t"}
NR==1 { print }
NR>1 { a[$1]++ }
NR>1 && $2!="" { b[$1]=$2 }
NR>1 && $2=="" { c[$1]=$2 }
END { for (i in a) {
if ( c[i] ) {print i,b[i]}
else {print i,b[i]}
}
}
' test.tsv
However, it shouldn't produce the desired result because in the event c[i] is empty, b[i] should be empty and it should give up. What am I missing here please?

your c[i] is useless, you always print the same combination. You can simplify it a bit and I think it will get clearer
$ awk 'NR==1 {print; next}
{a[$1]=a[$1]==""?$2:a[$1]}
END {for(k in a) print k,a[k]}' file | column -t
name date
A 2011-01-21
B
C 2011-02-04
D 2011-03-26
E 2011-05-13
F 2011-02-20
G 2011-05-10
H
only update the mapping if the value is not blank, so this will capture the first non-blank value for each key if there is any.
Assuming the values are dates you can replace the middle block with !a[$1]{a[$1]=$2}

Related

SAS EG How to compare cell values in an array loop?

I am currently trying to compare cell values on the same row over multiple columns, but having issues with referencing the correct cells.
My data currently is this:
col1
col2
col3
col4
col5
col6
a
b
c
d
e
f
a
b
c
d
e
e
a
b
c
d
d
d
I would like to compare col{i} to col{i+1} and drop values when repeated to give:
col1
col2
col3
col4
col5
col6
a
b
c
d
e
f
a
b
c
d
e
-
a
b
c
d
-
-
My current code is:
data want;
set have;
array c{*} col;
do i = 1 to dim(c);
do j = i+1;
if c{i} = c{j} then .;
else c{i};
end;
end;
run;
TIA
data want;
set have;
array c{*} col:;
do i = dim(c) to 2 by -1; *no reason to check #1;
if c{i} = c{i-1} then call missing(c{i}); *if identical to prior, clear out;
end;
run;
You don't need two loops - just one - as you're just checking the record "before" (or "after", but "before" is easier to mentally comprehend, at least for me). Start on 2, check the one prior, and if identical, clear it out.
Importantly, this goes in reverse order (so it gets the d situation above) - if you go left to right, it won't get the last d as it won't compare to the right one.
For the case of data containing multiple segments of repeated values and wanting only unique consecutive values you will need to track an insertion index.
Example: Variable j tracks the insertion point
data have;
input (col1-col6) ($) #1 (kol1-kol6) ($);
format col: kol: $1.;
datalines;
a b c d e f
a b c d e e
a b c d d d
a a b b c c
a a b b a a
. b b b c d
a a . . c c
run;
data want(keep=col: kol:);
set have;
array c col1-col6;
j = 1;
do i = 2 to dim(c);
if c(i) ne c(j) then do;
j = j + 1;
if i ne j then do;
c(j) = c(i);
call missing(c(i));
end;
end;
end;
do j = j+1 to i-1;
call missing(c{j});
end;
run;
For the case of wanting only unique values of the array, you can use a bubble sorting comparison approach when the number of elements is smallish, say <10.
/* uniqueness via a bubbly search */
data want_b;
set have;
array c col1-col6;
j=0;
do i = 1 to dim(c);
if missing(c{i}) then continue;
do k = 1 to j; * bubble, bubble;
if c{k} = c{i} then do;
call missing(c{i});
leave;
end;
end;
if missing(c{i}) then continue;
j = j + 1;
if j < i then do;
c{j} = c{i};
call missing(c{i});
end;
end;
run;
When the number of elements increases you can use a hash to be more efficient whilst ensuring uniqueness.
/* uniqueness via hash lookup */
data want_h(keep=col: kol:);
set have;
array c col1-col6;
if _n_ = 1 then do;
declare hash v();
length value $20; * must be at least as long as longest of c{*} variable ;
v.defineKey('value');
v.defineData('i');
v.defineDone();
call missing(value);
end;
j = 0;
do i = 1 to dim(c);
if not missing(c{i}) then if v.check(key:c{i}) ne 0 then do;
v.add(key:c{i},data:i);
j = j + 1;
if i ne j then
c(j) = c(i);
end;
end;
do j = j+1 to dim(c);
call missing(c{j});
end;
v.clear();
run;

Snowflake "Invalid UTF8 detected in string SNOWFLAKE" using PUT

I've got a csv file with this content:
"Compañía","Aeropuerto Base","Año","Clase","Grupo Compañía","Mes","Movimiento","País","Servicio","Tipo Avión","Tipo Tráfico","Operaciones Totales"
"2 EXCEL AVIATION LTD","ADOLFO SUÁREZ MADRID-BARAJAS","2020","UE SCHENGEN","Total","","","","","","","0"
"2 EXCEL AVIATION LTD","ADOLFO SUÁREZ MADRID-BARAJAS","2020","UE NO SCHENGEN","Total","","","","","","","4"
"2 EXCEL AVIATION LTD","ADOLFO SUÁREZ MADRID-BARAJAS","2020","INTERNACIONAL","Total","","","","","","","2"
I've uploaded it to snowflake using the stage feature:
PUT 'file://C:\\tmp\\opc2020.csv' #demo_stage;
I've created a file format:
CREATE OR REPLACE FILE FORMAT demo_file_format TYPE = 'CSV' field_delimiter = ',';
If I try to query the content:
SELECT C.$1 FROM #demo_stage (file_format => 'demo_file_format') C
I get an error:
SQL Error [100144] [22000]: Invalid UTF8 detected in string '0xFF0xFE"0x00C0x00o0x00m0x00p0x00a0x000xF10x000xED0x00a0x00"0x00'
File 'opc2020.csv.gz', line 1, character 1
Row 1, column "TRANSIENT_STAGE_TABLE"["$1":1]
If I add the VALIDATE_UTF8 = false attribute then I can query the stage but losing the UTF8 characters and with some unexpected whitespace between characters:
CREATE OR REPLACE FILE FORMAT dbt_demo_file_format TYPE = 'CSV' field_delimiter = ',' VALIDATE_UTF8 = TRUE;
��" C o m p a � � a " |
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
How can I solve this?
If the original file was generated using a different character set than UTF-8 then you would get into this issue.
If you know the character set used to generate the file then you can set into the FILE FORMAT statement the ENCODING parameter to the correct value.
In your case if the original file was created using UTF-16LE then your CREATE FILE FORMAT would look like:
CREATE OR REPLACE FILE FORMAT demo_file_format TYPE = 'CSV' field_delimiter = ',' ENCODING = 'UTF-16 LE' VALIDATE_UTF8 = TRUE FIELD_OPTIONALLY_ENCLOSED_BY = '"';
More information on character sets and encoding is on our docs here.

Using nested loops in bash to replace characters in a string with an array of words

I've been picking up Bash lately and I'm having trouble wrapping my head around nested loops.
Here's what I got.
input='ATAATAATAATG'
CODONTABLE=(ATA I ATC I ATT I ATG M ACA T
ACC T ACG T ACT T AAC N AAT N
AAA K AAG K AGC S AGT S AGA R
AGG R CTA L CTC L CTG L CTT L
CCA P CCC P CCG P CCT P CAC H
CAT H CAA Q CAG Q CGA R CGC R
CGG R CGT R GTA V GTC V GTG V
GTT V GCA A GCC A GCG A GCT A
GAC D GAT D GAA E GAG E GGA G
GGC G GGG G GGT G TCA S TCC S
TCG S TCT S TTC F TTT F TTA L
TTG L TAC Y TAT Y TAA _ TAG _
TGC C TGT C TGA _ TGG W)
for ((i=0;i<${#input};i++)) ; do
let w+=1
for c in $input ; do
for h in $CODONTABLE ; do
if [ $(echo ${input:x:3})=$(echo $CODONTABLE[w]) ] ; then
mod+=(${CODONTABLE[w]})
let x+=1
else
let w+=1
fi
done
done
done
echo $mod
echo $input
What I get from this is...
ATAATAATAATG
I
So it seems that at least ATA was properly translated into an I.
However, what I want is
**ATA**ATAATAATG -> I
A**TAA**TAATAATG -> _
AT**AAT**AATAATG -> N
ATA**ATA**ATAATG -> I
So that the final output reads I_NI_NI_NM, which I use later.
In short, how do I create a proper repeating loop that goes through my input, translates every possible 3 character frame, and appends this to another array?
There are actually a lot of problems with your code. Some of them are pure logic errors; others are due to misunderstandings about what various Bash constructs do. (Though I'm guessing that some of the pure logic errors are due to trial-and-error attempts to fix problems caused by misunderstandings about what various Bash constructs do.) So as a general suggestion, I'd suggest writing and testing small pieces, to see how they work, and using debugging output (small statements like echo "i=$i w=$w c=$c h=$h" that help you see what's going in your code). That will help you build up to a working program.
Below are a few specific problems. They are not a complete list.
This:
for ((i=0;i<${#input};i++)) ; do
let w+=1
...
done
will give w the values 1, 2, 3, … 12. But I think you actually want w to take the values 0, 3, 6, 9? For that, you should write:
for (( w = 0 ; w < ${#input} ; w += 3)) ; do
...
done
(I apologize if I've misunderstood what w is supposed to be. Its name is not very mnemonic, and you seem to use it a few different ways, so it's hard to be sure. Incidentally — I recommend putting some effort into naming your variables better. It makes code so much easier to understand and debug.)
Since $input does not contain any whitespace, this:
for c in $input ; do
...
done
is equivalent to this:
c=$input
...
(Maybe you were expecting for c in $input to loop over the characters of $input? But that's not what that notation does.)
You seem to be trying to treat CODONTABLE as an associative array, but you haven't written it to be one. If you're using a version of Bash that supports associative arrays, then you should use a real one:
declare -A CODONTABLE=([ATA]=I [ATC]=I [ATT]=I [ATG]=M [ACA]=T
[ACC]=T [ACG]=T [ACT]=T [AAC]=N [AAT]=N
[AAA]=K [AAG]=K [AGC]=S [AGT]=S [AGA]=R
[AGG]=R [CTA]=L [CTC]=L [CTG]=L [CTT]=L
[CCA]=P [CCC]=P [CCG]=P [CCT]=P [CAC]=H
[CAT]=H [CAA]=Q [CAG]=Q [CGA]=R [CGC]=R
[CGG]=R [CGT]=R [GTA]=V [GTC]=V [GTG]=V
[GTT]=V [GCA]=A [GCC]=A [GCG]=A [GCT]=A
[GAC]=D [GAT]=D [GAA]=E [GAG]=E [GGA]=G
[GGC]=G [GGG]=G [GGT]=G [TCA]=S [TCC]=S
[TCG]=S [TCT]=S [TTC]=F [TTT]=F [TTA]=L
[TTG]=L [TAC]=Y [TAT]=Y [TAA]=_ [TAG]=_
[TGC]=C [TGT]=C [TGA]=_ [TGG]=W)
If not, then your regular-array approach is fine, but rather than trying to use a deeply-nested loop to find the right mapping in CODONTABLE, you should put that logic in its own function:
function dna_codon_to_amino ($) {
local dna_codon="$1"
local i
for (( i = 0 ; i < ${CODONTABLE[#]} ; i += 2 )) ; do
if [[ "$dna_codon" = "${CODONTABLE[i]}" ]] ; then
echo "${CODONTABLE[i+1]}"
return
fi
done
# whoops, didn't find anything. print a warning to standard error,
# return the amino acid '#', and indicate non-success:
echo "Warning: invalid DNA codon: '$dna_codon'" >&2
echo '#'
return 1
}
Then you can call it by writing something like:
amino_codon="$(dna_codon_to_amino "$dna_codon")"
There's a lot of good in ruakh's answer, but there isn't an explanation of how to step through the string 3 letters at a time, I think. This code does that:
#!/usr/bin/env bash-4.3
declare -A CODONTABLE
CODONTABLE=(
[ATA]=I [ATC]=I [ATT]=I [ATG]=M [ACA]=T
[ACC]=T [ACG]=T [ACT]=T [AAC]=N [AAT]=N
[AAA]=K [AAG]=K [AGC]=S [AGT]=S [AGA]=R
[AGG]=R [CTA]=L [CTC]=L [CTG]=L [CTT]=L
[CCA]=P [CCC]=P [CCG]=P [CCT]=P [CAC]=H
[CAT]=H [CAA]=Q [CAG]=Q [CGA]=R [CGC]=R
[CGG]=R [CGT]=R [GTA]=V [GTC]=V [GTG]=V
[GTT]=V [GCA]=A [GCC]=A [GCG]=A [GCT]=A
[GAC]=D [GAT]=D [GAA]=E [GAG]=E [GGA]=G
[GGC]=G [GGG]=G [GGT]=G [TCA]=S [TCC]=S
[TCG]=S [TCT]=S [TTC]=F [TTT]=F [TTA]=L
[TTG]=L [TAC]=Y [TAT]=Y [TAA]=_ [TAG]=_
[TGC]=C [TGT]=C [TGA]=_ [TGG]=W
)
input='ATAATAATAATG'
i=("AAAAACAAGAATACAACCACGACTAGAAGCAGGAGTATAATCATGATT"
"CAACACCAGCATCCACCCCCGCCTCGACGCCGGCGTCTACTCCTGCTT"
"GAAGACGAGGATGCAGCCGCGGCTGGAGGCGGGGGTGTAGTCGTGGTT"
"TAATACTAGTATTCATCCTCGTCTTGATGCTGGTGTTTATTCTTGTTT"
)
for string in "$input" "${i[#]}"
do
echo "$string"
fmt=$(printf " %%-%ds %%3s %%s\\\\n" ${#string})
#echo "$fmt"
output=""
while [ ${#string} -ge 3 ]
do
codon=${string:0:3}
output="$output${CODONTABLE[$codon]}"
printf "$fmt" "$string" "$codon" "$output"
string=${string#?}
done
done
The key parts are the associative array and the two expressions:
codon=${string:0:3} # Extract 3 characters from offset 0 of string
string=${string#?} # Drop the first character from string
The first part of the output is:
ATAATAATAATG
ATAATAATAATG ATA I
TAATAATAATG TAA I_
AATAATAATG AAT I_N
ATAATAATG ATA I_NI
TAATAATG TAA I_NI_
AATAATG AAT I_NI_N
ATAATG ATA I_NI_NI
TAATG TAA I_NI_NI_
AATG AAT I_NI_NI_N
ATG ATG I_NI_NI_NM

how to compare values in the same field?

I have this input
name num value
A 1010232 1
A 1010232 2
A 1010232 3
B 2565214 1
B 2565214 2
B 2565214 3
C 6111111 2
C 6111111 3
.
.
O need output like this:
the name C has no "1" value actually
I don't have any idea about the way to solve it
$ cat file
name num value
A 1010232 1
A 1010232 2
A 1010232 3
B 2565214 1
B 2565214 2
B 2565214 3
C 6111111 2
C 6111111 3
$ awk '
NR>1 { seen[$1,$3]++; names[$1]; vals[$3] }
END {
for (name in names)
for (val in vals)
if (!seen[name,val])
printf "the name %s has no \"%s\" value actually\n", name, val
}
' file
the name C has no "1" value actually
You could try:
awk -f chk.awk input.txt
where chk.awk is:
{
a[$1,$3]++
}
END {
if (!("C","1") in a)
print "the name C has no \"1\" value"
}

Print specific parts of a cell as strings in matlab?

I have the following matrix array B :
B=[1 2 3; 10 20 30 ; 100 200 300 ; 500 600 800];
Which through a code is combined to form possible combinations between the values. The results are stored in cell G. Such that G :
G=
[1;20;100;500]
[0;30;0;800]
[3;0;0;600]
.
.
etc
I want to format the results based on which value from B is chosen :
[1 2 3] = 'a=1 a=2 a=3'
[10 20 30] = 'b=1 b=2 b=3'
[100 200 300]= 'c=1 c=2 c=3'
[500 600 800]= 'd=1 d=2 d=3'
Example, using the results in the current cell provided :
[1;20;100;500]
[0;30;0;800]
[3;0;0;600]
Should print as
a=1 & b=2 & c=1 & d=1
a=0 & b=3 & c=0 & d=3 % notice that 0 should be printed although not present in B
a=3 & b=0 & c=0 & d=2
Note that the cell G will vary depending on the code and is not fixed. The code used to generate the results can be viewed here : Need help debugging this code in Matlab
Please let me know if you require more info about this.
You can try this:
k = 1; % which row of G
string = sprintf('a = %d, b = %d, c = %d, d = %d',...
max([0 find(B(1,:) == G{k}(1))]), ...
max([0 find(B(2,:) == G{k}(2))]), ...
max([0 find(B(3,:) == G{k}(3))]), ...
max([0 find(B(4,:) == G{k}(4))]) ...
);
For instance, for k = 1 of your example data this results in
string =
a = 1, b = 2, c = 1, d = 1
A short explanation of this code (as requested in the comments) is as follows. For the sake of simplicity, the example is limited to the first value of G and the first line of B.
% searches whether the first element in G can be found in the first row of B
% if yes, the index is returned
idx = find(B(1,:) == G{k}(1));
% if the element was not found, the function find returns an empty element. To print
% a 0 in these cases, we perform max() as a "bogus" operation on the results of
% find() and 0. If idx is empty, it returns 0, if idx is not empty, it returns
% the results of find().
val = max([0 idx])
% this value val is now formatted to a string using sprintf
string = sprintf('a = %d', val);

Resources