awk: print range of fields if other field matches value - file

I have a file with a very old format. Here's a couple of lines of examples:
000000582103145338520001 2000111420040924NR19 2RG195006 0119MR<PATRICK JOSEPH ROBERT<SNOWBALL<<<<THE OLD RECTORY<LONGHAM<EAST DEREHAM<NORFOLK<<INSURANCE COMPANY OFFICIAL<BRITISH<<
000000582103015819370001 1994010119981130CR2 8SZ 194205 0096MR<PETER GEOFFREY<WARD<<<<14 SUFFIELD CLOSE<SELSDON<SOUTH CROYDON<<<EXECUTIVE DIRECTOR<ENGLISH<<
000000582203047002770001 1992012619931231N1 8HP 193401 0099<JOHN HOWARD<WEBB<<<<1 SUDELEY STREET<ISLINGTON<LONDON<<<GROUP ACTUARY - COMMERCIAL UNION<BRITISH<<
000000582103000497250003 1998070119981130TN13 1SS195207 0126MR<RICHARD ANDREW<WHITAKER<LLB DMS FCII<<<STRATHBLANE ASHGROVE ROAD<<SEVENOAKS<KENT<<COMPANY SECRETARY<BRITISH<UNITED KINGDOM<
000000781D 00000020WALKER & ETH PORKER<
000000831D 00000014REID AND SONS<
000000841D 00000019A. WEST & PARTNERS<
000000861 00130029KENTSTONE PROPERTIES LIMITED<
I am trying to get the characters from 41st till the end of the line if and only if the 9th character is a 1. I know that the max number of chars after char 41 is 161.
Here's my awk - which breaks (mainly tried to compose it from different code found online - not an awk expert here).
awk -v b=41 -v e=201
'$9 == "1"
BEGIN{FS=OFS=""} {for (i=b;i<=e;i++)
printf "%s%s", $i, (i<e ? OFS : ORS)}'
<(head -n1000 myfile.dat)
What I expect the code to output:
WALKER & ETH PORKER<
REID AND SONS<
A. WEST & PARTNERS<
KENTSTONE PROPERTIES LIMITED<

Could you please try following.
awk 'substr($0,9,1) == 1{print substr($0,41)}' Input_file
Explanation:
awk ' ##Starting awk program here.
substr($0,9,1) == 1{ ##Using substr for getting sub-string from 9th character to get only 1 character and checking condition if its value is equal to 1. If condition is TRUE then perform following.
print substr($0,41) ##Printing sub-string value from 41st character to till end of line(since no last limit is given so it will take complete line from 41st character).
} ##Closing BLOCK for condition here.
' Input_file ##Mentioning Input_file name here.

A small variation of Ravinders post. (gnu awk)
awk -v FS= '$9==1 {print substr($0,41)}' file
WALKER & ETH PORKER<
REID AND SONS<
A. WEST & PARTNERS<
KENTSTONE PROPERTIES LIMITED<
For help with substr, see:
https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

Related

awk calculate euclidean distance results in wrong output

I have this small geo location dataset.
37.9636140,23.7261360
37.9440840,23.7001760
37.9637190,23.7258230
37.9901450,23.7298770
From a random location.
For example this one 37.97570, 23.66721
I need to create a bash command with awk that returns the distances with simple euclidean distance.
This is the command i use
awk -v OFMT=%.17g -F',' -v long=37.97570 -v lat=23.66721 '{for (i=1;i<=NR;i++) distances[i]=sqrt(($1 - long)^2 + ($2 - lat)^2 ); a[i]=$1; b[i]=$2} END {for (i in distances) print distances[i], a[i], b[i]}' filename
When I run this command i get this weird result which is not correct, could someone explain to me what am I doing wrong?
➜ awk -v OFMT=%.17g -F',' -v long=37.97570 -v lat=23.66721 '{for (i=1;i<=NR;i++) distances[i]=sqrt(($1 - long)^2 + ($2 - lat)^2 ); a[i]=$1; b[i]=$2} END {for (i in distances) print distances[i], a[i], b[i]}' filename
44,746962127881936 37.9440840 23.7001760
44,746962127881936 37.9901450 23.7298770
44,746962127881936 37.9636140 23.7261360
44,746962127881936
44,746962127881936 37.9637190 23.7258230
Updated.
Appended the command that #jas provided, I included od -c as #mark-fuso suggetsted.
The issue now is that I get different results from #jas
Command output which showcases the new issue.
awk -v OFMT=%.17g -F, -v long=37.97570 -v lat=23.66721 '
{distance=sqrt(($1 - long)^2 + ($2 - lat)^2 ); print distance, $1, $2}
' file
1,1820150904705098 37.9636140 23.7261360
1,1820150904705098 37.9440840 23.7001760
1,1820150904705098 37.9637190 23.7258230
1,1820150904705098 37.9901450 23.7298770
od -c that shows the content of the input file.
od -c file
0000000 3 7 . 9 6 3 6 1 4 0 , 2 3 . 7 2
0000020 6 1 3 6 0 \n 3 7 . 9 4 4 0 8 4 0
0000040 , 2 3 . 7 0 0 1 7 6 0 \n 3 7 . 9
0000060 6 3 7 1 9 0 , 2 3 . 7 2 5 8 2 3
0000100 0 \n 3 7 . 9 9 0 1 4 5 0 , 2 3 .
0000120 7 2 9 8 7 7 0 \n
0000130
While #jas has provided a 'fix' for the problem, thought I'd throw in a few comments about what OP's code is doing ...
Some basics ...
the awk program ({for (i=1;i<=NR;i++) ... ; b[i]=$2}) is applied against each row of the input file
as each row is read from the input file the awk variable NR keeps track of the row number (ie, NR=1 for the first row, NR=2 for the second row, etc)
on the last pass through the for loop the counter (i in this case) will have a value of NR+1 (ie, the i++ is applied on the last pass through the loop thus leaving i=NR+1)
unless there are conditional checks for each line of input the awk program will apply against every line from the input file (including blank lines - more on this below)
for (i in distances)... isn't guaranteed to process the array indices in numerical order
The awk/for loop is doing the following:
for the 1st input row (NR=1) we get for (i=1;i<=1;i++) ...
for the 2nd input row (NR=2) we get for (i=1;i<=2;i++) ...
for the 3rd input row (NR=3) we get for (i=1;i<=3;i++) ...
for the 4th input row (NR=4) we get for (i=1;i<=4;i++) ...
For each row processed by awk the program will overwrite all previous entries in the distance[] array; net result is the last row (NR=4) will place the same values in all 4 entries of the the distance[] array.
The a[i]=$1; b[i]=$2 array assignments occur outside the scope of the for loop so these will be assigned once per input row (ie, will not be overwritten) however, the array assignments are being made with i=NR+1; net result is the contents of the 1st row (NR=1) are stored in array entries a[2] and b[2], the contents of the 2nd row (NR=2) are stored in array entries a[3] and a[3], etc.
Modifying OP's code with print i, distances[i], a[i], b[i]} and running against the 4-line input file I get:
1 0.064310270672728084 # no data for 2nd/3rd columns because a[1] and b[1] are never set
2 0.064310270672728084 37.9636140 23.7261360 # 2nd/3rd columns are from 1st row of input
3 0.064310270672728084 37.9440840 23.7001760 # 2nd/3rd columns are from 2nd row of input
4 0.064310270672728084 37.9637190 23.7258230 # 2nd/3rd columns are from 3rd row of input
From this we can see the first column of output is the same (ie, distance[1]=distance[2]=distance[3]=distance[4]), while the 2nd and 3rd columns are the same as the input columns except they are shifted 'down' by one row.
That leaves us with two outstanding issues ...
why does OP show 5 lines of output?
why is the first column consist of the garbage 44,746962127881936?
I was able to reproduce this issue by adding a blank line on the end of my input file:
$ cat geo.dat
37.9636140,23.7261360
37.9440840,23.7001760
37.9637190,23.7258230
37.9901450,23.7298770
<<=== blank line !!
Which generates the following with OP's awk code:
44.746962127881936
44.746962127881936 37.9636140 23.7261360
44.746962127881936 37.9440840 23.7001760
44.746962127881936 37.9637190 23.7258230
44.746962127881936 37.9901450 23.7298770
NOTES:
this order is different from OP's sample output and is likely due to OP's awk version not processing for (i in distances)... in numerical order; OP can try something like for (i=1;i<=NR;i++)... or for (i=1;i in distances; i++)... (though the latter will not work correcly for a sparsely populated array)
OPs output (in the question; in comment to #jas' answer) shows a comma (,) in place of the period (.) for the first column so I'm guessing OP's env is using a locale that switches the comma/period as thousands/decimal delimiter (though the input data is based on an 'opposite' locale)
Notice we finally get to see the data from the 4th line of input (shifted 'down' and displayed on line 5) but the first column has what appears to be a nonsensical value ... which can be tracked back to applying the following against a blank line:
sqrt(($1 - long)^2 + ($2 - lat)^2 )
sqrt(( - long)^2 + ( - lat)^2 ) # empty line => $1 = $2 = undefined/empty
sqrt(( - 37.97570)^2 + ( - 23.66721^2 )
sqrt( 1442.153790 + 560.136829 )
sqrt( 2002.290619 )
44.746952... # contents of 1st column
To 'fix' this issue the OP can either a) remove the blank line from the input file or b) add some logic to the awk script to only perform calculations if the input line has (numeric) values in fields #1 & #2 (ie, $1 and $2 are not empty); it's up to the coder to decide on how much validation to apply (eg, are the fields numeric, are the fields within the bounds of legitimate long/lat values, etc).
One last design-related comment ... as demonstrated in jas' answer there is no need for any of the arrays (which in turn reduces memory usage) when all desired output can generated 'on-the-fly' while processing each line of the input file.
Awk takes care of the looping for you. The code will be run in turn for each line of the input file:
$ awk -v OFMT=%.17g -F, -v long=37.97570 -v lat=23.66721 '
{distance=sqrt(($1 - long)^2 + ($2 - lat)^2 ); print distance, $1, $2}
' file
0.060152679674309095 37.9636140 23.7261360
0.045676346307474212 37.9440840 23.7001760
0.059824979147508742 37.9637190 23.7258230
0.064310270672728084 37.9901450 23.7298770
EDIT:
OP is getting different results. I notice in OP's output that there are commas instead of decimal points when printing the distance. This points to a possible issue with the locale setting.
OP confirms that the locale was set for greek, causing the difference in output.

awk split string on commas ignore if inside double quotes

I know it may sounds that there are 2000 answer to this question online but I found none for this specific case (ex. -vFPAT of this and other answers) cause I need to be with split. I have to split a CSV file with awk in which there may be some values inside double quotes. I need to tell the split function to ignore , if inside "" in order to get an array of the elements.
Here what I tried based on other answers as example
cat try.txt
Hi,I,"am,your",father
maybe,you,knew,it
but,"I,wanted",to,"be,sure"
cat tst.awk
BEGIN {}
{
n_a = split($0,a,/([^,]*)|("[^"]+")/);
for (i=1; i<=n_a; i++) {
collecter[NR][i]=a[i];
}
}
END {
for (i=1; i<=length(collecter); i++)
{
for (z=1; z<=length(collecter[i]);z++)
{
printf "%s\n", collecter[i][z];
}
}
}
but no luck:
awk -f tst.awk try.txt
,
,
,
,
,
,
,
,
,
I tried other regex expression based on other similar answer but none works for this particular case.
Please note: double quoted fields mat and may not be present, may be more than one, and without fixed position/length!
Thanks in advance for any help!
gnu awk has a function called patsplit that lets you do a split using an FPAT pattern:
$ awk '{ print "RECORD " NR ":"; n=patsplit($0, a, "([^,]*)|(\"[^\"]+\")"); for (i=1;i<=n;++i) {print i, "|" a[i] "|"}}' file
RECORD 1:
1 |Hi|
2 |I|
3 |"am,your"|
4 |father|
RECORD 2:
1 |maybe|
2 |you|
3 |knew|
4 |it|
RECORD 3:
1 |but|
2 |"I,wanted"|
3 |to|
4 |"be,sure"|
If Python is an alternative, here is a solution:
try.txt:
Hi,I,"am,your",father
maybe,you,knew,it
but,"I,wanted",to,"be,sure"
Python snippet:
import csv
with open('try.txt') as f:
reader = csv.reader(f, quoting=csv.QUOTE_ALL)
for row in reader:
print(row)
The code snippet above will result in:
['Hi', 'I', 'am,your', 'father']
['maybe', 'you', 'knew', 'it']
['but', 'I,wanted', 'to', 'be,sure']

Filter rows of files conditional on multiple arrays values

I have a number of files (N>1000) with qtl summary data e.g. lets assume the first file is made of six lines (in reality they are all GWAs/imputed files with >10M SNPs)
cat QTL.1.txt
Chr Rs BP beta se pvalue
11 rs11224233 134945522 0.150216 0.736939 0.962375
11 rs4616056 134945709 0.129518 0.371824 0.910326
11 rs11823417 134945710 0.103462 0.41737 0.845826
11 rs80294507 134945765 0.150336 0.735363 0.961403
11 rs61907173 134946034 0.104531 0.158224 0.884548
11 rs147621717 134946277 0.105365 0.196168 0.86476
I would like to filter each of these datasets based on chromosome and positions of a list of genes (my list has 100 genes but now lest assume it has 2); therefore creating N_QTL*N_Genes files. I would like to go through each gene/position for each QTL. The Chromosome, positions and name of the genes are stored in four arrays and I would like to read iteratively these arrays and save the output for each qtl file for each gene.
What I have done so far doesnt work and I know awk is not the best way to do this:
declare -a array1
declare -a array2
declare -a array3
declare -a array4
array1=(11 11) #chromosome
array2=(134945709 134945765) #start gene position
array3=(134946034 134946277) #end gene position
array4=(A B) # gene name
for qtl in 1; do # in reality it would be for qtl in 1 1000
for ((i=0; i<${#array1[#]}; i++)); do
cat QTL.$qtl.txt | awk '$1=='array1[$i]' && $3>='array2[$i]' &&
$3<='array3[$i]' {print$0}' > Gene.${array4[$i]}_QTL.$qtl.txt;
done;
done
within awk $1 is the chromosome and $3 the position- so therefore filtering based on these.
So my expected output for QTL.1.txt for Gene A would be
cat Gene.A_QTL.1.txt
Chr Rs BP beta se pvalue
11 rs4616056 134945709 0.129518 0.371824 0.910326
11 rs11823417 134945710 0.103462 0.41737 0.845826
11 rs80294507 134945765 0.150336 0.735363 0.961403
11 rs61907173 134946034 0.104531 0.158224 0.884548
And for QTL.1.txt for Gene B would be
cat Gene.B_QTL.1.txt
Chr Rs BP beta se pvalue
11 rs80294507 134945765 0.150336 0.735363 0.961403
11 rs61907173 134946034 0.104531 0.158224 0.884548
11 rs147621717 134946277 0.105365 0.196168 0.86476
I end up with empty files as probably the way I ask these columns to be filtered based on the values of the arrays doesnt work.
Any help very much appreciated!
Thank you in advance
Mixing bash and awk for parsing files is not always the best way forward.
Here a solution with awk only.
Assume you have the information assigned to your bash array in a file:
$ cat info
11 134945765 154945765 Gene1
12 134945522 174945522 Gene2
You could use the following awk script to perform a lookup with the data file:
awk 'NR==FNR{
for(i=2;i<=NF;i++)
a[$1,i]=$i
next
}
a[$1,2]<=$3 && a[$1,3]>=$3{
print $0 > a[$1,4]"_QTL"
}' info QTL.1.txt
This will create a file with the following content:
$ cat Gene1_QTL
11 rs80294507 134945765 0.150336 0.735363 0.961403
11 rs61907173 134946034 0.104531 0.158224 0.884548
11 rs147621717 134946277 0.105365 0.196168 0.86476
Maybe not exactly what you're looking at, but yet I hope this is helpful...
You might want to do the following if multiple genes are located in the same chromosome (using gene name instead of chr as Key):
awk 'NR==FNR{
chr[$4]=$1;
start[$4]=$2;
end[$4]=$3;
}
NR!=FNR{
for (var in chr){
name=var"_"FILENAME;
if(chr[var]==$1 && start[var] <=$3 && end[var]>=$3){
print $0 > name;
}
}
}' info QTL

Merging csv file's lines with the same initial fields and sorting them by their length

I have a huge csv file with 4 fields for each line in this format (ID1, ID2, score, elem):
HELLO, WORLD, 2323, elem1
GOODBYE, BLUESKY, 3232, elem2
HELLO, WORLD, 421, elem3
GOODBYE, BLUESKY, 41134, elem4
ETC...
I would like to merge each line which has the same ID1,ID2 fields on the same line eliminating the score field, resulting in:
HELLO, WORLD, elem1, elem3.....
GOODBYE, BLUESKY, elem2, elem4.....
ETC...
where each elem come from a different line with the same ID1,ID2.
After that I would like to sort the lines on the basis of their length.
I have tried to do coding in java but is superslow. I have read online about AWK, but I can't really find a good spot where I can understand its syntax for csv files.
I used this command, how can I adapt it to my needs?
awk -F',' 'NF>1{a[$1] = a[$1]","$2}END{for(i in a){print i""a[i]}}' finale.txt > finale2.txt^C
your key should be composite, also delimiter need to be set to accommodate comma and spaces.
$ awk -F', *' -v OFS=', ' '{k=$1 OFS $2; a[k]=k in a?a[k] OFS $4:$4}
END{for(k in a) print k, a[k]}' file
GOODBYE, BLUESKY, elem2, elem4
HELLO, WORLD, elem1, elem3
Explanation
set field separator (FS) to comma followed with one or more spaces, and output field separator (OFS) to normalized form (comma and one space). Create a composite key from first two fields separated with OFS (since we're going to use it in the output). Append the fourth field to the array element indexed by key (treat first element special since we don't want to start with OFS). When all records are done (END block) print all keys and values.
To add the length keep a parallel counter and increment each time you append for each key, c[k]++ and use it when printing. That is,
$ awk -F', *' -v OFS=', ' '{k=$1 OFS $2; c[k]++; a[k]=k in a?a[k] OFS $4:$4}
END{for(k in a) print k, c[k], a[k]}' file |
sort -t, -k3n
GOODBYE, BLUESKY, 2, elem2, elem4
HELLO, WORLD, 2, elem1, elem3

Bash formatting text file into columns

I have a text file with data in it which is set up like a table, but separated with commas, eg:
Name, Age, FavColor, Address
Bob, 18, blue, 1 Smith Street
Julie, 17, yellow, 4 John Street
Firstly I have tried using a for loop, and placing each 'column' with all its values into a separate array.
eg/ 'nameArray' would contain bob, julie.
Here is the code from my actual script, there is 12 columns hence why c should not be greater than 12.
declare -A Array
for((c = 1; c <= 12; c++))
{
for((i = 1; i <= $total_lines; i++))
{
record=$(cat $FILE | awk -F "," 'NR=='$i'{print $'$c';exit}'| tr -d ,)
Array[$c,$i]=$record
}
}
From here I then use the 'printf' function to format each array and print them as columns. The issue with this is that I have more than 3 arrays, in my actual code they're all in the same 'printf' line. Which I don't like and I know it is a silly way to do it.
for ((i = 1; i <= $total_lines; i++))
{
printf "%0s %-10s %-10s...etc \n" "${Array[1,$i]}" "${Array[2,$i]}" "${Array[3,$i]}" ...etc
}
This does however give me the desired output, see image below:
I would like to figure out how to do this another way that doesn't require a massive print statement. Also the first time I call the for loop I get an error with 'awk'.
Any advice would be appreciated, I have looked through multiple threads and posts to try and find a suitable solution but haven't found something that would be useful.
Try the column command like
column -t -s','
This is what I can get quickly. See the man page for details.

Resources