Linux Bash: how to compare VAR with characters pool? - arrays

I have the text file with the elements of Periodic table, looking like
1 H *Name of element in 1 language* *Second l-ge* *Third* etc
2 He *Name of element in 1 language* *Second l-ge* *Third* etc
etc
(with names of the elements in different human languages)
And I need to remove all the words excluding only 1 target language, and for that I need to know how to compare STRING with array of characters (so that if none of the array letters would be matched with the comparing WORD from the textfile, this word wouldn't be printed).
Could anyone help with that?
More details / How I tried to solve this:
The first problem I encounter is that I don't know how to compare VAR with array of characters:
I wrote something like this:
#!/bin/sh
#PeriodicTable1
counter=0
while [ "$counter" -le "$#" ] ; do
counter=$(( $counter+1 )) ;
if [ "${$counter}" == *[1234567890abcdefghigklmn and etc, so here all the characters from the language that is not removing right now]* ] ; then
echo "${$counter}" ;
done
to run ./PeriodicTable1 OPTION1 OPTION2 OPT3 OPT4 etc (where options are text from the text file)\
and to get the words from text as options $1 $2 $3 etc, until $#\
and to compare them with the array of characters from language that has been chosen to be saved\
and just after (if it would work) I would use ">" to redirect output into a text file
So If the text in the file would be like this: 1 H Hydrogen Wasserstoff Hydrogène 氢 and if I would be needed to save only
Chinese, then I would typed in Terminal ./PeriodicTable1.sh 1 H Hydrogen Wasserstoff Hydrogène 氢 and would be expecting output like
this: 1 H 氢 by removing all other stuff 'cuz characters in the
words (as the options of a command) wouldn't match the Chinese
character --> wouldn't be printed
Then I wrote something like
#!/bin/sh
#PeriodicTable2
for WORD in (and here is the whole text) ; do
if [ "$WORD" == *[straight line of characters]* ; echo "$WORD" ;
done
and then
#!/bin/sh
#PeriodicTable3
counter=1 ;
save={1,2,3,4,5,6,7,8,9,0,a,b,c,d, etc} ;
while [ "$counter" -le "$#" ] ; case "${$counter}" in
*"$save"*)
echo ${$counter} ;
counter=$(( $counter+1 )) ;
;
esac
and then
#!/bin/sh
#PeriodicTable4
counter=1 ;
save={a,b, etc}
while [ "${0+$counter}" -le "$#" ] ; {
if [ "${0+$counter}" == *"$save"* ] echo "${0+$counter}" ;
counter=$(( $counter+1 )) ;
}
but nor a one of these script routines does work
and the second problem is that that if there are need to save a nonEnglish language, then letters of elements would be disappearing; but I guess that with that I could be handling, if for the first problem solution would be founded...
UPDATE:
Solution is still unfound, but I found that task already could be
done by
awk command
awk '{print $number_of_word1_in_line, $n2, $n3}'
With counting words by $var and printing only those which positions are equal to their number in line + X*n, where X is the number of words between the lines and n is var in {1..total_number_of_lines}\
With tr command
tr -c a-zA-Z '\n' < someFile.txt | sed '/^$/d'

I think you are looking for logic along the lines of what I have put together below. NOTE: I have re-formatted the input file, because it is better to have a fixed character (that would never be used in any field) as a delimiter.
As an exercise, I created my own raw table with many more languages populated (in a LibreOffice ODS file, then exported with the bars; I can provide that if you want), from which I rebuilt the modified version of your table, where I used the vertical bar (|) as the safe delimiter.
The logic was also expanded to use the first line to assign the input format values into the "cols" array, and in so doing allowing the script to adapt to the format of the input file (variations of column assignments/contents) but still output based on the specified language using --language flag.
If interested, the source of the info I extracted came from the corresponding pages identified here.
Lastly, Persian goes right-to-left, contrary to others in your table going from left-to-right. The print statement does the context-driven justification based on column position, but that info could be provided on a second header line, and parsed accordingly for output formatting.
Hope this addresses your needs!
#!/bin/sh
### QUESTION: https://stackoverflow.com/questions/74234471/linux-bash-how-to-compare-var-with-characters-pool
DBG=0
while [ $# -gt 0 ]
do
case $1 in
--language ) tLANG=$2 ; shift ; shift ;;
--debug ) DBG=1 ; shift ;;
* ) echo "\n\t Invalid option specified on the command line. Only valid: [ --language {} ] \n Bye!\n" ; exit 1 ;;
esac
done
if [ -z "${tLANG}" ] ; then echo "\n\t Language specification required on the command line. \n\n\t Usage: $0 --language [ English | Italian | Greek | Polish | French | Persian ]\n" ; exit 1 ; fi
### Format of ELEMENTS_TABLE
#Z Symbol English Italian Greek Polish French Persian
ELEMENTS_TABLE="Elements_table.csv"
### Format of ELEMENTS_RAW
#1 |2 |3___________________
#____|4 |5 |6 |7 |8 |9 |10 |11 |12 |13 |14 |15 |16 |17 |18 ||20 |21 |22 |23
#Atomic Number|Symbol|Atomic Mass (amu, g/mol)|Latin|ALTERNATE|English|Greek|German|French|Italian|Spanish|Polish|Turkish|Russian|Afrikaans|Sesotho|Chinese|Hindi||Japanese|Hebrew|Arabic|Persian
### Position Mapping for Requestor: 1 2 6 10 7 12 9 23
#ELEMENTS_RAW="Elements_details.csv"
#awk -F \| '{
# printf("%s|%s|%s|%s|%s|%s|%s|%s|\n", $1, $2, $6, $10, $7, $12, $9, $23 ) ;
#}' < "${ELEMENTS_RAW}" >"${ELEMENTS_TABLE}"
### Match on language position
#echo $tLANG
awk -F \| -v lang="${tLANG}" -v debug="${DBG}" '\
BEGIN{
getline header ;
if( debug == 1 ){
print header ;
} ;
n=split( header, cols, "|") ;
if( debug == 1 ){
for( pos=1 ; pos <= 8 ; pos++ ){
print cols[pos] ;
} ;
} ;
#cols[1]="Z" ;
#cols[2]="Symbol" ;
#cols[3]="English" ;
#cols[4]="Italian" ;
#cols[5]="Greek" ;
#cols[6]="Polish" ;
#cols[7]="French" ;
#cols[8]="Persian" ;
lCOL=0 ;
for( pos=1 ; pos <= 8 ; pos++ ){
if( debug == 1 ){
print "cols[", pos, "] = ", cols[pos] ;
} ;
if( cols[pos] == lang ){
lCOL=pos ;
} ;
} ;
if( lCOL == 8 ){
printf("%-4s %-6s %15s\n", cols[1], cols[2], cols[lCOL] ) ;
}else{
printf("%-4s %-6s %-s\n", cols[1], cols[2], cols[lCOL] ) ;
} ;
}{
#printf("%-4s %-3s %-s %-10.7f\n", $1, $2, $lCOL, $3 ) ;
if( lCOL == 8 ){
printf("%-4s %-6s %15s\n", $1, $2, $lCOL ) ;
}else{
printf("%-4s %-6s %-s\n", $1, $2, $lCOL ) ;
} ;
}' <${ELEMENTS_TABLE}

Related

printing invalid options using arrays

I am storing details on invalid option to a function using arrays, w to store the positional index and warg to store the option name.
declare -a w=()
declare -a warg=()
local k=0 vb=0 ctp="" sty=""
while (( $# > 0 )); do
k=$((k+1))
arg="$1"
case $arg in
("--vlt")
ctp="$vlt" ; r=$k ; shift 1 ;;
("--blu")
blu=$(tput setaf 12)
ctp="$blu" ; r=$k ; shift 1 ;;
("--grn")
grn=$(tput setaf 2)
ctp="$grn" ; r=$k ; shift 1 ;;
("-m"*)
sty="${1#-m}" ; r=$k ; shift 1 ;;
("--")
shift 1 ; break ;;
("-"*)
w+=($k) ; warg+=("$arg")
shift 1 ;;
(*)
break ;;
esac
done
After that, I try to loop through the invalid options but as one can see, the positional elements in iw do not map to ${warg[$iw]}, in a way that I can print the invalid option name. What can I do?
r=4
local iw
for iw in "${w[#]}"; do
if (( iw < r )); then
printf '%s\n' "Invalid option | ${warg[$iw]}"
fi
done
Numerically-indexed arrays don't have to have sequentical indices.
Using an array of color name to number reduces the duplication in the case branches.
declare -A colors=(
[blu]=12
[grn]=2
[ylw]=3
[orn]=166
[pur]=93
[red]=1
[wht]=7
)
declare -a warg=()
for ((i = 1; i <= $#; i++)); do
arg=${!i} # value at position $i
case $arg in
--blu|--grn|--ylw|--orn|--pur|--red|--wht)
ctp=$(tput setaf "${colors[${arg#--}]}") ;;
--vlt) ctp="$vlt" ;;
-m*) sty="${arg#-m}" ;;
--) break ;;
-*) warg[i]=$arg ;;
*) break ;;
esac
done
numArgs=$i
shift $numArgs
# iterate over the _indices_ of the wrong args array
for i in "${!warg[#]}"; do
echo "wrong arg ${warg[i]} at postition $i"
done
That is untested, so there may be bugs
Addressing just the mapping issue between w[] and warg[] ...
The problem seems to be that $k is being incremented for every $arg, but the index for warg[] is only incremented for an 'invalid' $arg.
If the sole purpose of the w[] array is to track the position of 'invalid' args then you can eliminate the w[] array and populate warg[] as a sparse array with the following change to the current code:
# replace:
warg+=("$arg")
# with:
warg[$k]="$arg"
With this change your for loop becomes:
for iw in "${!warg[#]}"
do
....
printf '%s\n' "Invalid option | ${warg[$iw]}"
done

How do you reference a variable within another variable in bash?

I am trying to do something which I thought would be fairly simple but not having any joy.
If I run the following, it successfully pulls the region (reg) from an array called PORT330 and checks to see it contains the value of $i. For example, it could be checking to see if "Europe London" contains the word "London". Here is what works:
if [[ ${PORT330[reg]} == *"$i"* ]] ; then
echo 302 is in $i ;
fi
However, I actually have a list of Port array's to check so it may be PORT330, PORT550 and so on. I want to be able to substitute the port number with a variable but then call it within a variable. Here is what I am trying to do:
This works:
for portid in ${portids[#]} ; do
for i in ${regions[#]} ; do
if [[ ${PORT330[reg]} == *"$i"* ]] ; then
echo $portid is in $i ;
fi ;
done ;
done
However this doesn't work:
for portid in ${portids[#]} ; do
for i in ${regions[#]} ; do
if [[ ${PORT$portid[reg]} == *"$i"* ]] ; then
echo $portid is in $i ;
fi ;
done ;
done
It throws this error:
-su: ${PORT$portid[reg]}: bad substitution
Any pointers as to where I am going wrong?
In BASH variable expansion there is an option to have indirection via the ${!name} and ${!name[index]} schemes
for portid in ${portids[#]} ; do
for i in ${regions[#]} ; do
arr=PORT$portid
if [[ ${!arr[reg]} == *"$i"* ]] ; then
echo $portid is in $i ;
fi ;
done ;
done
Here is a complete example
PORT330[reg]=a ;
PORT550[reg]=b ;
for portid in 330 550 ; do
for i in a b ;
do arr=PORT$portid ;
if [[ ${!arr[reg]} == *"$i"* ]] ; then
echo $portid is in $i ;
fi ;
done ;
done
Produces
330 is in a
550 is in b
Another example
:~> portids=(330 350 )
:~> echo ${portids[#]}
330 350
:~> PORT350[reg]=London
:~> PORT330[reg]=Berlin
:~> for portid in ${portids[#]} ; do arr=PORT$portid ; echo ${!arr[reg]} ; done
Berlin
London

Bash function with array won't work

I am trying to write a function in bash but it won't work. The function is as follows, it gets a file in the format of:
1 2 first 3
4 5 second 6
...
I'm trying to access only the strings in the 3rd word in every line and to fill the array "arr" with them, without repeating identical strings.
When I activated the "echo" command right after the for loop, it printed only the first string in every iteration (in the above case "first").
Thank you!
function storeDevNames {
n=0
b=0
while read line; do
line=$line
tempArr=( $line )
name=${tempArr[2]}
for i in $arr ; do
#echo ${arr[i]}
if [ "${arr[i]}" == "$name" ]; then
b=1
break
fi
done
if [ "$b" -eq 0 ]; then
arr[n]=$name
n=$(($n+1))
fi
b=0
done < $1
}
The following line seems suspicious
for i in $arr ; do
I changed it as follows and it works for me:
#! /bin/bash
function storeDevNames {
n=0
b=0
while read line; do
# line=$line # ?!
tempArr=( $line )
name=${tempArr[2]}
for i in "${arr[#]}" ; do
if [ "$i" == "$name" ]; then
b=1
break
fi
done
if [ "$b" -eq 0 ]; then
arr[n]=$name
(( n++ ))
fi
b=0
done
}
storeDevNames < <(cat <<EOF
1 2 first 3
4 5 second 6
7 8 first 9
10 11 third 12
13 14 second 15
EOF
)
echo "${arr[#]}"
You can replace all of your read block with:
arr=( $(awk '{print $3}' <"$1" | sort | uniq) )
This will fill arr with only unique names from the 3rd word such as first, second, ... This will reduce the entire function to:
function storeDevNames {
arr=( $(awk '{print $3}' <"$1" | sort | uniq) )
}
Note: this will provide a list of all unique device names in sorted order. Removing duplicates also destroys the original order. If preserving the order accept where duplicates are removed, see 4ae1e1's alternative.
You're using the wrong tool. awk is designed for this kind of job.
awk '{ if (!seen[$3]++) print $3 }' <"$1"
This one-liner prints the third column of each line, removing duplicates along the way while preserving the order of lines (only the first occurrence of each unique string is printed). sort | uniq, on the other hand, breaks the original order of lines. This one-liner is also faster than using sort | uniq (for large files, which doesn't seem to be applicable in OP's case), since this one-liner linearly scans the file once, while sort is obviously much more expensive.
As an example, for an input file with contents
1 2 first 3
4 5 second 6
7 8 third 9
10 11 second 12
13 14 fourth 15
the above awk one-liner gives you
first
second
third
fourth
To put the results in an array:
arr=( $(awk '{ if (!seen[$3]++) print $3 }' <"$1") )
Then echo ${arr[#]} will give you first second third fourth.

awk geometric average on the same row value

I have the below input and I would like to do geometric average if the “Cpd_number” and ”ID3” are the same. The files have a lot of data so we might need arrays to do the tricks. However, as an awk beginner, I am not very sure how to start. Could anyone kindly offer some hints?
input:
“ID1”,“Cpd_number”, “ID2”,”ID3”,”activity”
“95”,“123”,”4”,”5”,”10”
“95”, “123”,”4”,”5”,”100”
“95”, “123”,”4”,”5”,”1”
“95”, “123”,”4”,”6”,”10”
“95”, “123”,”4”,”6”,”100”
“95”, “456”,”4”,”6”,”10”
“95”, “456”,”4”,”6”,”100”
Three lines of “95”,“123”,”4”,”5” should do a geometric average
Two lines of “95”, “123”,”4”,”6” should do a geometric average
Two lines of “95”, “456”,”4”,”6” should do a geometric average
Here is the desired output:
“ID1”,“Cpd_number”, “ID2”,”ID3”,”activity”
“95”,“123”,”4”,”5”,”10”
“95”, “123”,”4”,”6”,”31.62”
“95”, “456”,”4”,”6”,”31.62”
Some info about geometric mean:
http://en.wikipedia.org/wiki/Geometric_mean
This script computes a geometric mean
#!/usr/bin/awk -f
{
b = $1; # value of 1st column
C += log(b);
D++;
}
END {
print "Geometric mean : ",exp(C/D);
}
Having this file:
$ cat infile
"ID1","Cpd_number","ID2","ID3","activity"
"95","123","4","5","10"
"95","123","4","5","100"
"95","123","4","5","1"
"95","123","4","6","10"
"95","123","4","6","100"
"95","456","4","6","10"
"95","456","4","6","100"
This piece:
awk -F\" 'BEGIN{print} # Print headers
last != $4""$8 && last{ # ONLY When last key "Cpd_number + ID3"
print line,exp(C/D) # differs from actual , print line + average
C=D=0} # reset acumulators
{ # This block process each line of infile
C += log($(NF-1)+0) # C calc
D++ # D counter
$(NF-1)="" # Get rid of activity col ir order to print line
line=$0 # Line will be actual line without activity
last=$4""$8} # Store the key in orther to track switching
END{ # This block triggers after the complete file read
# to print the last average that cannot be trigger during
# the previous block
print line,exp(C/D)}' infile
Will throw:
ID1 , Cpd_number , ID2 , ID3 , 0
95 , 123 , 4 , 5 , 10
95 , 123 , 4 , 6 , 31.6228
95 , 456 , 4 , 6 , 31.6228
Still some work left for formatting.
NOTE: char " is used instead of “ and ”
EDIT: NF is the number of fields in file , so NF-1 will be the next to last:
$ awk -F\" 'BEGIN{getline}{print $(NF-1)}' infile
10
100
1
10
100
10
100
So in: log($(NF-1)+0) we apply log function to that value (0 sum is added to ensure numeric value)
D++ y just a counter.
Why use awk, just do it in bash, with either bc or calc to handle floating point math. You can download calc at http://www.isthe.com/chongo/src/calc/ (2.12.4.13-11 is latest). There are rpms, binary and source tarballs available. It is far superior to bc in my opinion. The routine is fairly simple. You need to remove the extranious " quotes from your datafile first leaving a csv file. That helps. See the sed command used in the comments below. Note, the geometric mean below is the 4th root of (id1*cpd*id2*id3). If you need a different mean, just adjust the code below:
#!/bin/bash
##
## You must strip all quotes from data before processing, or write more code to do
## it here. Just do "$ sed -d 's/\"//g' < datafile > newdatafile" Then use
## newdatafile as command line argument to this program
##
## Additionally, this script uses 'calc' for floating point math. go download it
## from: http://www.isthe.com/chongo/src/calc/ (2.12.4.13-11 is latest). You can also
## use bc if you like, but why, calc is so much better.
##
## test to make sure file passed as argument is readable
test -r "$1" || { echo "error: invalid input, usage: ${0//*\//} filename"; exit 1; }
## function to strip extraneous whitespace from input
trimWS() {
[[ -z $1 ]] && return 1
strln="${#1}"
[[ strln -lt 2 ]] && return 1
trimSTR=$1
trimSTR="${trimSTR#"${trimSTR%%[![:space:]]*}"}" # remove leading whitespace characters
trimSTR="${trimSTR%"${trimSTR##*[![:space:]]}"}" # remove trailing whitespace characters
echo $trimSTR
return 0
}
let cnt=0
let oldsum=0 # holds value to compare against new Cpd_number & ID3
product=1 # initialize product to 1
pcnt=0 # initialize the number of values in product
IFS=$',\n' # Internal Field Separator, set to break on ',' or newline
while read newid1 newcpd newid2 newid3 newact || test -n "$act"; do
cpd=`trimWS $cpd` # trimWS from cpd (only one that needed it)
# if first iteration, just output first row
test "$cnt" -eq 0 && echo " $newid1 $newcpd $newid2 $newid3 $newact"
# after first iteration, test oldsum -ne sum, if so do geometric mean
# and reset product and counters
if test "$cnt" -gt 0 ; then
sum=$((newcpd+newid3)) # calculate sum to test against oldsum
if test "$oldsum" -ne "$sum" && test "$cnt" -gt 1; then
# geometric mean (nth root of product)
# mean=`calc -p "root ($product, $pcnt)"` # using calc
mean=`echo "scale=6; e( l($product) / $pcnt)" | bc -l` # using bc
echo " $id1 $cpd $id2 $id3 average: $mean"
pcnt=0
product=1
fi
# update last values to new values
oldsum=$sum
id1="$newid1"
cpd="$newcpd"
id2="$newid2"
id3="$newid3"
act="$newact"
((product*=act)) # accumulate product
((pcnt+=1))
fi
((cnt+=1))
done < "$1"
output:
# output using calc
ID1 Cpd_number ID2 ID3 activity
95 123 4 5 average: 10
95 123 4 6 average: 31.62277660168379331999
95 456 4 6 average: 31.62277660168379331999
# output using bc
ID1 Cpd_number ID2 ID3 activity
95 123 4 5 average: 9.999999
95 123 4 6 average: 31.622756
95 456 4 6 average: 31.622756
The updated script calculates the proper mean. It is a bit more involved due to having to keep old/new values to test for the change in cpd & id3. This may be where awk is the simpler way to go. But if you need more flexibility later, bash may be the answer.

Awk conditional filter one file based on another (or other solutions)

Programming beginner here needs some help modifying an AWK script to make it conditional. Alternative non-awk solutions are also very welcome.
NOTE Main filtering is now working thanks to help from Birei but I have an additional problem, see note below in question for details.
I have a series of input files with 3 columns like so:
chr4 190499999 190999999
chr6 61999999 62499999
chr1 145499999 145999999
I want to use these rows to filter another file (refGene.txt) and if a row in file one mathces a row in refGene.txt, to output column 13 in refGene.txt to a new file 'ListofGenes_$f'.
The tricky part for me is that I want it to count as a match as long as column one (eg 'chr4', 'chr6', 'chr1' ) and column 2 AND/OR column 3 matches the equivalent columns in the refGene.txt file. The equivalent columns between the two files are $1=$3, $2=$5, $3=$6.
Then I am not sure in awk how to not print the whole row from refGene.txt but only column 13.
NOTE I have achieved the conditional filtering described above thanks to help from Birei. Now I need to incorporate an additional filter condition. I also need to output column $13 from the refGene.txt file if any of the region between value $2 and $3 overlaps with the region between $5 and $6 in the refGene.txt file. This seems a lot trickier as it involves mathmatical computation to see if the regions overlap.
My script so far:
FILES=/files/*txt
for f in $FILES ;
do
awk '
BEGIN {
FS = "\t";
}
FILENAME == ARGV[1] {
pair[ $1, $2, $3 ] = 1;
next;
}
{
if ( pair[ $3, $5, $6 ] == 1 ) {
print $13;
}
}
' $(basename $f) /files/refGene.txt > /files/results/$(basename $f) ;
done
Any help is really appreciated. Thanks so much!
Rubal
One way.
awk '
BEGIN { FS = "\t"; }
## Save third, fifth and seventh field of first file in arguments (refGene.txt) as the key
## to compare later. As value the field to print.
FNR == NR {
pair[ $3, $5, $6 ] = $13;
next;
}
## Set the name of the output file.
FNR == 1 {
output_file = "";
split( ARGV[ARGIND], path, /\// );
for ( i = 1; i < length( path ); i++ ) {
current_file = ( output_file ? "/" : "" ) path[i];
}
output_file = output_file "/ListOfGenes_" path[i];
}
## If $1 = $3, $2 = $5 and $3 = $6, print $13 to output file.
{
if ( pair[ $1, $2, $3 ] ) {
print pair[ $1, $2, $3 ] >output_file;
}
}
' refGene.txt /files/rubal/*.txt

Resources