awk geometric average on the same row value - arrays

I have the below input and I would like to do geometric average if the “Cpd_number” and ”ID3” are the same. The files have a lot of data so we might need arrays to do the tricks. However, as an awk beginner, I am not very sure how to start. Could anyone kindly offer some hints?
input:
“ID1”,“Cpd_number”, “ID2”,”ID3”,”activity”
“95”,“123”,”4”,”5”,”10”
“95”, “123”,”4”,”5”,”100”
“95”, “123”,”4”,”5”,”1”
“95”, “123”,”4”,”6”,”10”
“95”, “123”,”4”,”6”,”100”
“95”, “456”,”4”,”6”,”10”
“95”, “456”,”4”,”6”,”100”
Three lines of “95”,“123”,”4”,”5” should do a geometric average
Two lines of “95”, “123”,”4”,”6” should do a geometric average
Two lines of “95”, “456”,”4”,”6” should do a geometric average
Here is the desired output:
“ID1”,“Cpd_number”, “ID2”,”ID3”,”activity”
“95”,“123”,”4”,”5”,”10”
“95”, “123”,”4”,”6”,”31.62”
“95”, “456”,”4”,”6”,”31.62”
Some info about geometric mean:
http://en.wikipedia.org/wiki/Geometric_mean
This script computes a geometric mean
#!/usr/bin/awk -f
{
b = $1; # value of 1st column
C += log(b);
D++;
}
END {
print "Geometric mean : ",exp(C/D);
}

Having this file:
$ cat infile
"ID1","Cpd_number","ID2","ID3","activity"
"95","123","4","5","10"
"95","123","4","5","100"
"95","123","4","5","1"
"95","123","4","6","10"
"95","123","4","6","100"
"95","456","4","6","10"
"95","456","4","6","100"
This piece:
awk -F\" 'BEGIN{print} # Print headers
last != $4""$8 && last{ # ONLY When last key "Cpd_number + ID3"
print line,exp(C/D) # differs from actual , print line + average
C=D=0} # reset acumulators
{ # This block process each line of infile
C += log($(NF-1)+0) # C calc
D++ # D counter
$(NF-1)="" # Get rid of activity col ir order to print line
line=$0 # Line will be actual line without activity
last=$4""$8} # Store the key in orther to track switching
END{ # This block triggers after the complete file read
# to print the last average that cannot be trigger during
# the previous block
print line,exp(C/D)}' infile
Will throw:
ID1 , Cpd_number , ID2 , ID3 , 0
95 , 123 , 4 , 5 , 10
95 , 123 , 4 , 6 , 31.6228
95 , 456 , 4 , 6 , 31.6228
Still some work left for formatting.
NOTE: char " is used instead of “ and ”
EDIT: NF is the number of fields in file , so NF-1 will be the next to last:
$ awk -F\" 'BEGIN{getline}{print $(NF-1)}' infile
10
100
1
10
100
10
100
So in: log($(NF-1)+0) we apply log function to that value (0 sum is added to ensure numeric value)
D++ y just a counter.

Why use awk, just do it in bash, with either bc or calc to handle floating point math. You can download calc at http://www.isthe.com/chongo/src/calc/ (2.12.4.13-11 is latest). There are rpms, binary and source tarballs available. It is far superior to bc in my opinion. The routine is fairly simple. You need to remove the extranious " quotes from your datafile first leaving a csv file. That helps. See the sed command used in the comments below. Note, the geometric mean below is the 4th root of (id1*cpd*id2*id3). If you need a different mean, just adjust the code below:
#!/bin/bash
##
## You must strip all quotes from data before processing, or write more code to do
## it here. Just do "$ sed -d 's/\"//g' < datafile > newdatafile" Then use
## newdatafile as command line argument to this program
##
## Additionally, this script uses 'calc' for floating point math. go download it
## from: http://www.isthe.com/chongo/src/calc/ (2.12.4.13-11 is latest). You can also
## use bc if you like, but why, calc is so much better.
##
## test to make sure file passed as argument is readable
test -r "$1" || { echo "error: invalid input, usage: ${0//*\//} filename"; exit 1; }
## function to strip extraneous whitespace from input
trimWS() {
[[ -z $1 ]] && return 1
strln="${#1}"
[[ strln -lt 2 ]] && return 1
trimSTR=$1
trimSTR="${trimSTR#"${trimSTR%%[![:space:]]*}"}" # remove leading whitespace characters
trimSTR="${trimSTR%"${trimSTR##*[![:space:]]}"}" # remove trailing whitespace characters
echo $trimSTR
return 0
}
let cnt=0
let oldsum=0 # holds value to compare against new Cpd_number & ID3
product=1 # initialize product to 1
pcnt=0 # initialize the number of values in product
IFS=$',\n' # Internal Field Separator, set to break on ',' or newline
while read newid1 newcpd newid2 newid3 newact || test -n "$act"; do
cpd=`trimWS $cpd` # trimWS from cpd (only one that needed it)
# if first iteration, just output first row
test "$cnt" -eq 0 && echo " $newid1 $newcpd $newid2 $newid3 $newact"
# after first iteration, test oldsum -ne sum, if so do geometric mean
# and reset product and counters
if test "$cnt" -gt 0 ; then
sum=$((newcpd+newid3)) # calculate sum to test against oldsum
if test "$oldsum" -ne "$sum" && test "$cnt" -gt 1; then
# geometric mean (nth root of product)
# mean=`calc -p "root ($product, $pcnt)"` # using calc
mean=`echo "scale=6; e( l($product) / $pcnt)" | bc -l` # using bc
echo " $id1 $cpd $id2 $id3 average: $mean"
pcnt=0
product=1
fi
# update last values to new values
oldsum=$sum
id1="$newid1"
cpd="$newcpd"
id2="$newid2"
id3="$newid3"
act="$newact"
((product*=act)) # accumulate product
((pcnt+=1))
fi
((cnt+=1))
done < "$1"
output:
# output using calc
ID1 Cpd_number ID2 ID3 activity
95 123 4 5 average: 10
95 123 4 6 average: 31.62277660168379331999
95 456 4 6 average: 31.62277660168379331999
# output using bc
ID1 Cpd_number ID2 ID3 activity
95 123 4 5 average: 9.999999
95 123 4 6 average: 31.622756
95 456 4 6 average: 31.622756
The updated script calculates the proper mean. It is a bit more involved due to having to keep old/new values to test for the change in cpd & id3. This may be where awk is the simpler way to go. But if you need more flexibility later, bash may be the answer.

Related

How to process all or selected rows in a csv file where column headers and order are dynamic?

I'd like to either process one row of a csv file or the whole file.
The variables are set by the header row, which may be in any order.
There may be up to 12 columns, but only 3 or 4 variables are needed.
The source files might be in either format, and all I want from both is lastname and country. I know of many different ways and tools to do it if the columns were fixed and always in the same order. But they're not.
examplesource.csv:
firstname,lastname,country
Linus,Torvalds,Finland
Linus,van Pelt,USA
examplesource2.csv:
lastname,age,country
Torvalds,66,Finland
van Pelt,7,USA
I have cobbled together something from various Stackoverflow postings which looks a bit voodoo but seems fairly robust. I say "voodoo" because shellcheck complains that, for example, "firstname is referenced but not assigned". And yet it prints it.
#!/bin/bash
#set the field seperator to newline
IFS=$'\n'
#split/transpose the first-line column titles to rows
COLUMNAMES=$(head -n1 examplesource.csv | tr ',' '\n')
#set an array and read the columns into it
columns=()
for line in $COLUMNAMES; do
columns+=("$line")
done
#reset the field seperator
IFS=","
#using -p here to debug in output
declare -ap columns
#read from line 2 onwards
sed 1d examplesource.csv | while read "${columns[#]}"; do
echo "${firstname} ${lastname} is from ${country}"
done
In the case of looping through everything, it works perfectly for my needs and I can process within the "while read" loop. But to make it cleaner, I'd rather pass the current element(?) to an external function to process (not just echo).
And if I only wanted the array (current row) belonging to "Torvalds", I cannot find how to access that or even get its current index, eg: "if $wantedname && $lastname == $wantedname then call function with currentrow only otherwise loop all rows and call function".
I know there aren't multidimensional associative arrays in bash from reading
Multidimensional associative arrays in Bash and I've tried to understand arrays from
https://opensource.com/article/18/5/you-dont-know-bash-intro-bash-arrays
Is it clear what I'm trying to achieve in a bash-only manner and does the question make sense?
Many thanks.
Let's short your function. Don't read the source twice (first with head then with sed). You can do that once. Also the whole array reading can be shorten to just IFS=',' COLUMNAMES=($(head -n1 source.csv)). Here's a shorter version:
#!/bin/bash
cat examplesource.csv |
{
IFS=',' read -r -a columnnames
while IFS=',' read -r "${columnnames[#]}"; do
echo "${firstname} ${lastname} is from ${country}"
done
}
If you want to parse both files and the same time, ie. join them, nothing simpler ;). First, let's number lines in the first file using nl -w1 -s,. Then we use join to join the files on the name of the people. Remember that join input needs to be sort-ed using proper fields. Then we sort the output with sort using the number from the first file. After that we can read all the data just like that:
# join the files, using `,` as the seaprator
# on the 3rd field from the first file and the first field from the second file
# the output should be first the fields from the first file, then the second file
# the country (field 1.4) is duplicated in 2.3, so just omiting it.
join -t, -13 -21 -o 1.1,1.2,1.3,2.2,2.3 <(
# number the lines in the first file
<examplesource.csv nl -w1 -s, |
# there is one field more, sort using the 3rd field
sort -t, -k3
) <(
# sort the second file using the first field
<examplesource2.csv sort -t, -k1
) |
# sort the output using the numbers from the first file
sort -t, -k1 -n |
# well, remove the numbers
cut -d, -f2- |
# just a normal read follows
{
# read the headers
IFS=, read -r -a names
while IFS=, read -r "${names[#]}"; do
# finally out output!
echo "${firstname} ${lastname} is from ${country} and is so many ${age} years old!"
done
}
Tested on tutorialspoint.
GNU Awk has multidimensional arrays. It also has array sorting mechanisms, which I have not used here. Please comment if you are interested in pursuing this solution further. The following depends on consistent key names and line numbers across input files, but can handle an arbitrary number of fields and input files.
$ gawk -V |gawk NR==1
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.5, GNU MP 6.1.2)
$ gawk -F, '
FNR == 1 {for(f=1;f<=NF;f++) Key[f]=$f}
FNR != 1 {for(f=1;f<=NF;f++) People[FNR][Key[f]]=$f}
END {
for(Person in People) {
for(attribute in People[Person])
output = output FS People[Person][attribute]
print substr(output,2)
output=""
}
}
' file*
66,Finland,Linus,Torvalds
7,USA,Linus,van Pelt
A bash solution takes a bit more work than an awk solution, but if this is an exercise over what bash provides, it provides all you need to handle determining the column holding the last name from the first line of input and then outputting the lastname from the remaining lines.
An easy approach is simply to read each line into a normal array and then loop over the elements of the first line to locate the column "lastname" appears in saving the column in a variable. You can then read each of the remaining lines the same way and output the lastname field by outputting the element at the saved column.
A short example would be:
#!/bin/bash
col=0 ## column count for lastname
cnt=0 ## line count
while IFS=',' read -a arr; do ## read each line into array
if [ "$cnt" -eq '0' ]; then ## test if line-count is zero
for ((i = 0; i < "${#arr[#]}"; i++)); do ## loop for lastname
[ "${arr[i]}" = 'lastname' ] && ## test for lastname
{ col=i; break; } ## if found set cos = 1, break loop
done
fi
[ "$cnt" -gt '0' ] && ## if not headder row
echo "line $cnt lastname: ${arr[col]}" ## output lastname variable
((cnt++)) ## increment linecount
done < "$1"
Example Use/Output
Using your two files data files, the output would be:
$ bash readcsv.sh ex1.csv
line 1 lastname: Torvalds
line 2 lastname: van Pelt
$ bash readcsv.sh ex2.csv
line 1 lastname: Torvalds
line 2 lastname: van Pelt
A similar implementation using awk would be:
awk -F, -v col=1 '
NR == 1 {
for (i in FN) {
if (i = "lastname") next
}
col++
}
NR > 1 {
print "lastname: ", $col
}
' ex1.csv
Example Use/Output
$ awk -F, -v col=1 'NR == 1 { for (i in FN) { if (i = "lastname") next } col++ } NR > 1 {print "lastname: ", $col }' ex1.csv
lastname: Torvalds
lastname: van Pelt
(output is the same for either file)
Thank you all. I've taken a couple of bits from two answers
I used the answer from David to find the number of the row, then I used the elegantly simple solution from Kamil at to loop through what I need.
The result is exactly what I wanted. Thank you all.
$ readexample.sh examplesource.csv "Torvalds"
Everyone
Linus Torvalds is from Finland
Linus van Pelt is from USA
now just Torvalds
Linus Torvalds is from Finland
And this is the code - now that you know what I want it to do, if anyone can see any dangers or improvements, please let me know as I'm always learning. Thanks.
#!/bin/bash
FILENAME="$1"
WANTED="$2"
printDetails() {
SINGLEROW="$1"
[[ ! -z "$SINGLEROW" ]] && opt=("--expression" "1p" "--expression" "${SINGLEROW}p") || opt=("--expression" "1p" "--expression" "2,199p")
sed -n "${opt[#]}" "$FILENAME" |
{
IFS=',' read -r -a columnnames
while IFS=',' read -r "${columnnames[#]}"; do
echo "${firstname} ${lastname} is from ${country}"
done
}
}
findRow() {
col=0 ## column count for lastname
cnt=0 ## line count
while IFS=',' read -a arr; do ## read each line into array
if [ "$cnt" -eq '0' ]; then ## test if line-count is zero
for ((i = 0; i < "${#arr[#]}"; i++)); do ## loop for lastname
[ "${arr[i]}" = 'lastname' ] && ## test for lastname
{
col=i
break
} ## if found set cos = 1, break loop
done
fi
[ "$cnt" -gt '0' ] && ## if not headder row
if [ "${arr[col]}" == "$1" ]; then
echo "$cnt" ## output lastname variable
fi
((cnt++)) ## increment linecount
done <"$FILENAME"
}
echo "Everyone"
printDetails
if [ ! -z "${WANTED}" ]; then
echo -e "\nnow just ${WANTED}"
row=$(findRow "${WANTED}")
printDetails "$((row + 1))"
fi

Bash - cut a text file into columns, slice strings and store value into arrays

I would like some advice on some code.
I want to write a small script that will take an input file of this format
$cat filename.txt
111222233334444555666661112222AAAA
2222333445556612323244455445454545
2334556345643534505435345353453453
(and so on)
It will be called as : script inputfile X (where X is the number of slices you want to do)
I want the script to read the file and column-ize the slices, depending on user input, ie if he gave input 1,2 for the first slice, 3,4 for the second, the output would look like this:
#Here the first slice starts on the second digit, and length = 2 digits
#Here the second slice starts on the 3th digit and legth=4 digits
111 1222
222 2333
233 3455
This is what i have so far, but i only get the outputs of the first slicing arranged in a line, any advice please?
$./columncut filename.txt 2
#Initialize arrays
for ((i=1 ; i <= $2; i++)); do
echo "Enter starting digit of $i string"; read a[i]
echo "Enter length in digits of $i string"; read b[i]
done
#Skim through file, slice strings
while read line
do
for i in "${!a[#]}"; do
str[i]=${line:${a[i]}:${b[i]}}
done
for i in "${!str[#]}"; do
echo -n "$i "
done
done <$1
I am unaware if there's an easier to do a job like this, perhaps with awk? Any help would be much appreciated.
#usage: bash slice.sh d.txt "1-2" "4-8" "10-20"
#column postions 1-2, 4-8 and 10-20 printed.
#Note that it is not length but col position.
inf=$1 # source file
shift # arg1 is used up for file discard it.
while read -r line
do
for fmt #iterate over the arguments
do
slice=`echo $line | cut -c $fmt` # generate one slice
echo -n "$slice " # oupt with two blanks, but no newline
done
echo "" # Now give the newline
done < "$inf"
Sample run:
bash slice.sh d.txt "1-2" "4-8" "10-15"
11 22223 334444
22 23334 555661
23 45563 564353
Probably it is not very difficult to store all these generated slices in array.

bash- find average of numbers in line

I am trying to read a file line by line and find the average of the numbers in each line. I am getting the error: expr: non-numeric argument
I have narrowed the problem down to sum=expr $sum + $i, but I'm not sure why the code doesn't work.
while read -a rows
do
for i in "${rows[#]}"
do
sum=`expr $sum + $i`
total=`expr $total + 1`
done
average=`expr $sum / $total`
done < $fileName
The file looks like this (the numbers are separated by tabs):
1 1 1 1 1
9 3 4 5 5
6 7 8 9 7
3 6 8 9 1
3 4 2 1 4
6 4 4 7 7
With some minor corrections, your code runs well:
while read -a rows
do
total=0
sum=0
for i in "${rows[#]}"
do
sum=`expr $sum + $i`
total=`expr $total + 1`
done
average=`expr $sum / $total`
echo $average
done <filename
With the sample input file, the output produced is:
1
5
7
5
2
5
Note that the answers are what they are because expr only does integer arithmetic.
Using sed to preprocess for expr
The above code could be rewritten as:
$ while read row; do expr '(' $(sed 's/ */ + /g' <<<"$row") ')' / $(wc -w<<<$row); done < filename
1
5
7
5
2
5
Using bash's builtin arithmetic capability
expr is archaic. In modern bash:
while read -a rows
do
total=0
sum=0
for i in "${rows[#]}"
do
((sum += $i))
((total++))
done
echo $((sum/total))
done <filename
Using awk for floating point math
Because awk does floating point math, it can provide more accurate results:
$ awk '{s=0; for (i=1;i<=NF;i++)s+=$i; print s/NF;}' filename
1
5.2
7.4
5.4
2.8
5.6
Some variations on the same trick of using the IFS variable.
#!/bin/bash
while read line; do
set -- $line
echo $(( ( $(IFS=+; echo "$*") ) / $# ))
done < rows
echo
while read -a line; do
echo $(( ( $(IFS=+; echo "${line[*]}") ) / ${#line[*]} ))
done < rows
echo
saved_ifs="$IFS"
while read -a line; do
IFS=+
echo $(( ( ${line[*]} ) / ${#line[*]} ))
IFS="$saved_ifs"
done < rows
Others have already pointed out that expr is integer-only, and recommended writing your script in awk instead of shell.
Your system may have a number of tools on it that support arbitrary-precision math, or floats. Two common calculators in shell are bc which follows standard "order of operations", and dc which uses "reverse polish notation".
Either one of these can easily be fed your data such that per-line averages can be produced. For example, using bc:
#!/bin/sh
while read line; do
set - ${line}
c=$#
string=""
for n in $*; do
string+="${string:++}$1"
shift
done
average=$(printf 'scale=4\n(%s) / %d\n' $string $c | bc)
printf "%s // avg=%s\n" "$line" "$average"
done
Of course, the only bc-specific part of this is the format for the notation and the bc itself in the third last line. The same basic thing using dc might look like like this:
#!/bin/sh
while read line; do
set - ${line}
c=$#
string="0"
for n in $*; do
string+=" $1 + "
shift
done
average=$(dc -e "4k $string $c / p")
printf "%s // %s\n" "$line" "$average"
done
Note that my shell supports appending to strings with +=. If yours does not, you can adjust this as you see fit.
In both of these examples, we're printing our output to four decimal places -- with scale=4 in bc, or 4k in dc. We are processing standard input, so if you named these scripts "calc", you might run them with command lines like:
$ ./calc < inputfile.txt
The set command at the beginning of the loop turns the $line variable into positional parameters, like $1, $2, etc. We then process each positional parameter in the for loop, appending everything to a string which will later get fed to the calculator.
Also, you can fake it.
That is, while bash doesn't support floating point numbers, it DOES support multiplication and string manipulation. The following uses NO external tools, yet appears to present decimal averages of your input.
#!/bin/bash
declare -i total
while read line; do
set - ${line}
c=$#
total=0
for n in $*; do
total+="$1"
shift
done
# Move the decimal point over prior to our division...
average=$(($total * 1000 / $c))
# Re-insert the decimal point via string manipulation
average="${average:0:$((${#average} - 3))}.${average:$((${#average} - 3))}"
printf "%s // %0.3f\n" "$line" "$average"
done
The important bits here are:
* declare which tells bash to add to $total with += rather than appending it as if it were a string,
* the two average= assignments, the first of which multiplies $total by 1000, and the second of which splits the result at the thousands column, and
* printf whose format enforces three decimal places of precision in its output.
Of course, input still needs to be integers.
YMMV. I'm not saying this is how you should solve this, just that it's an option. :)
This is a pretty old post, but came up at the top my Google search, so thought I'd share what I came up with:
while read line; do
# Convert each line to an array
ARR=( $line )
# Append each value in the array with a '+' and calculate the sum
# (this causes the last value to have a trailing '+', so it is added to '0')
ARR_SUM=$( echo "${ARR[#]/%/+} 0" | bc -l)
# Divide the sum by the total number of elements in the array
echo "$(( ${ARR_SUM} / ${#ARR[#]} ))"
done < "$filename"

Bash function with array won't work

I am trying to write a function in bash but it won't work. The function is as follows, it gets a file in the format of:
1 2 first 3
4 5 second 6
...
I'm trying to access only the strings in the 3rd word in every line and to fill the array "arr" with them, without repeating identical strings.
When I activated the "echo" command right after the for loop, it printed only the first string in every iteration (in the above case "first").
Thank you!
function storeDevNames {
n=0
b=0
while read line; do
line=$line
tempArr=( $line )
name=${tempArr[2]}
for i in $arr ; do
#echo ${arr[i]}
if [ "${arr[i]}" == "$name" ]; then
b=1
break
fi
done
if [ "$b" -eq 0 ]; then
arr[n]=$name
n=$(($n+1))
fi
b=0
done < $1
}
The following line seems suspicious
for i in $arr ; do
I changed it as follows and it works for me:
#! /bin/bash
function storeDevNames {
n=0
b=0
while read line; do
# line=$line # ?!
tempArr=( $line )
name=${tempArr[2]}
for i in "${arr[#]}" ; do
if [ "$i" == "$name" ]; then
b=1
break
fi
done
if [ "$b" -eq 0 ]; then
arr[n]=$name
(( n++ ))
fi
b=0
done
}
storeDevNames < <(cat <<EOF
1 2 first 3
4 5 second 6
7 8 first 9
10 11 third 12
13 14 second 15
EOF
)
echo "${arr[#]}"
You can replace all of your read block with:
arr=( $(awk '{print $3}' <"$1" | sort | uniq) )
This will fill arr with only unique names from the 3rd word such as first, second, ... This will reduce the entire function to:
function storeDevNames {
arr=( $(awk '{print $3}' <"$1" | sort | uniq) )
}
Note: this will provide a list of all unique device names in sorted order. Removing duplicates also destroys the original order. If preserving the order accept where duplicates are removed, see 4ae1e1's alternative.
You're using the wrong tool. awk is designed for this kind of job.
awk '{ if (!seen[$3]++) print $3 }' <"$1"
This one-liner prints the third column of each line, removing duplicates along the way while preserving the order of lines (only the first occurrence of each unique string is printed). sort | uniq, on the other hand, breaks the original order of lines. This one-liner is also faster than using sort | uniq (for large files, which doesn't seem to be applicable in OP's case), since this one-liner linearly scans the file once, while sort is obviously much more expensive.
As an example, for an input file with contents
1 2 first 3
4 5 second 6
7 8 third 9
10 11 second 12
13 14 fourth 15
the above awk one-liner gives you
first
second
third
fourth
To put the results in an array:
arr=( $(awk '{ if (!seen[$3]++) print $3 }' <"$1") )
Then echo ${arr[#]} will give you first second third fourth.

Using awk to store values from column reads for multiple files

I am using cygwin on Windows 7. I have a directory with all text files and I want to loop through it and save the data from the second column of the first three rows for each of the file (1,2) (2,2) and (3,2).
So, the code would be something like
x1[0]=awk 'FNR == 1{print $2}'$file1
x1[1]=awk 'FNR == 2{print $2}'$file1
x1[2]=awk 'FNR == 3{print $2}'$file1
Then I want to use the divide by 100 of $x1 plus 1 to access data from other file and store it in the array. So that's:
let x1[0]=$x1[0]/100 + 1
let x1[1]=$(x1[1]/100)+1
let x1[2]=$(x1[2]/100)+1
read1=$(awk 'FNR == '$x1[0]' {print $1}' $file2)
read2=$(awk 'FNR == '$x1[1]' {print $1}' $file2)
read3=$(awk 'FNR == '$x1[2]' {print $1}' $file2)
Do the same thing for another file, except we don't need $x1 for this.
read4=$(awk 'FNR == 1{print $3,$4,$5,$6}' $file3)
Finally, just output all these values to a file i.e. read1-4
Need to do this in a loop for all the files in the folder, not quite sure how to go about that.The tricky part is that the filename of $file3 depends on the filename of $file1,
so if $file1 = abc123def.fna.map.txt
$file3 would be abc123def.fna
$file2 is hardcoded in it and stays the same for all the iterations.
file1 is a .txt file and a part of it looks like:
99 58900
16 59000
14 73000
file2 contains 600 lines of strings.
'Actinobacillus_pleuropneumoniae_L20'
'Actinobacillus_pleuropneumoniae_serovar_3_JL03'
'Actinobacillus_succinogenes_130Z'
'file3' is FASTA file and the first two lines look like this
>gi|94986445|ref|NC_008011.1| Lawsonia intracellularis PHE/MN1-00, complete genome
ATGAAGATCTTTTTATAGAGATAGTAATAAAAAAATGTCAGATAGATATACATTATAGTATAGTAGAGAA
The output can just write all the 4 reads to a random file or if possible can compare read1,read2,read3 and if it matches read4 i.e. the main name should match. In my example:
None of read1-3 match with Lawsonia intracellularis which is a part of read4. So it can just print success or failture to the file.
SAMPLE OUTPUT
Actinobacillus_pleuropneumoniae_L20
Actinobacillus_pleuropneumoniae_serovar_3_JL03
Actinobacillus_succinogenes_130Z
Lawsonia intracellularis
Failture
Sorry I was wrong about the 6 reads, just need 4 actually. Thanks for the help again.
This problem can be solved with TXR: http://www.nongnu.org/txr
Okay, I have these sample files (not your inputs, unfortunately):
$ ls -l
total 16
-rwxr-xr-x 1 kaz kaz 1537 2012-03-18 20:07 bac.txr # the program
-rw-r--r-- 1 kaz kaz 153 2012-03-18 19:16 foo.fna # file3: genome info
-rw-r--r-- 1 kaz kaz 24 2012-03-18 19:51 foo.fna.map.txt # file1
-rw-r--r-- 1 kaz kaz 160 2012-03-18 19:56 index.txt # file2: names of bacteria
$ cat index.txt
'Actinobacillus_pleuropneumoniae_L20'
'Actinobacillus_pleuropneumoniae_serovar_3_JL03'
'Lawsonia_intracellularis_PHE/MN1-00'
'Actinobacillus_succinogenes_130Z'
$ cat foo.fna.map.txt # note leading spaces: typo or real?
13 000
19 100
7 200
$ cat foo.fna
gi|94986445|ref|NC_008011.1| Lawsonia intracellularis PHE/MN1-00, complete genome
ATGAAGATCTTTTTATAGAGATAGTAATAAAAAAATGTCAGATAGATATACATTATAGTATAGTAGAGAA
As you can see, I cooked the data so there will be a match on the Lawsonia.
Run it:
$ ./bac.txr foo.fna.map.txt
Lawsonia intracellularis PHE/MN1-00 ATGAAGATCTTTTTATAGAGATAGTAATAAAAAAATGTCAGATAGATATACATTATAGTATAGTAGAGAA
Code follows. This is just a prototype; obviously it has to be developed and tested using the real data. I've made some guesses, like what the Lawsonia entry would look like in the index with the code attached to it.
#!/usr/local/bin/txr -f
#;;; collect the contents of the index fileo
#;;; into the list called index.
#;;; single quotes around lines are removed
#(block)
# (next "index.txt")
# (collect)
'#index'
# (end)
#(end)
#;;; filter underscores to spaces in the index
#(set index #(mapcar (op regsub #/_/ " ") index))
#;;; process files on the command line
#(next :args)
#(collect)
#;;; each command line argument has to match two patterns
#;;; #file1 takes the whole thing
#;;; #file3 matches the part before .map.txt
# (all)
#file1
# (and)
#file3.map.txt
# (end)
#;;; go into file 1 and collect second column material
#;;; over three lines into lineno list.
# (next file1)
# (collect :times 3)
#junk #lineno
# (end)
#;;; filter lineno list through a function which
#;;; converts to integer, divides by 100 and adds 1.
# (set lineno #(mapcar (op + 1 (trunc (int-str #1) 100))
lineno))
#;;; map the three line numbers to names through the
#;;; index, and bind these three names to variables
# (bind (name1 name2 name3) #(mapcar index lineno))
#;;; now go into file 3, and extract the name of the
#;;; bacterium there, and the genome from the 2nd line
# (next file3)
#a|#b|#c|#d| #name, complete genome
#genome
#;;; if the name matches one of the three names
#;;; then output the name and genome, otherwise
#;;; output failed
# (cases)
# (bind name (name1 name2 name3))
# (output)
#name #genome
# (end)
# (or)
# (output)
failed
# (end)
# (end)
#(end)

Resources