Extract columns from every six lines - file

I have a file that looks like this:
194170,46.9,42.2
194170,47.7,40.0
194170,48.5,42.0
194170,48.6,43.0
194170,49.8,39.2
194170,50.2,43.3
194179,44.9,36.9
194179,45.3,36.3
194179,46.4,36.9
194179,47.5,34.4
194179,48.0,40.0
194179,49.6,37.1
194184,52.8,51.1
194184,52.9,49.8
194184,54.0,51.9
194184,56.8,54.9
194184,57.6,53.6
194184,57.8,52.9
...
For a given line, the first number is an ID, and the second and third number are what I'm interested in. For those lines with the same ID (that is, every six lines), the numbers in the same column are numbers for consecutive years. I want to end up with a file that looks like this:
194170,46.9,47.7,48.5,48.6,49.8,50.2
194170,42.2,40.0,42.0,43.0,39.2,43.3
194179,44.9,45.3,46.4,47.5,48.0,49.6
194179,36.9,36.3,36.9,34.4,40.0,37.1
That is, for lines with the same ID, I want to group the consecutive numbers from the second column together, and likewise with the third column.
Is this possible to do with awk/sed/others?

Another answer with awk:
awk -F, '{a[$1] = a[$1]","$2}END{for(i in a) print i a[i]}' yourfile
For two columns:
awk -F, '{a[$1] = a[$1]","$2;b[$1] = b[$1]","$3}END{for(i in a) print i a[i]"\n"i b[i]}' yourfile
Anyway, I prefer tidyR in R for that kind of task.

With awk:
awk -F',' '{ a[$1] = a[$1] ? a[$1] FS $2 : $2 ; b[$1] = b[$1] ? b[$1] FS $3 : $3}
END { for(idx in a){ print idx,a[idx] ; print idx,b[idx]}}' yourfile
Explanation:
-F Field separator
a[] will have second column values
b[] will have third column values
END{} printing the values
Example:
$ awk -F',' '{ a[$1] = a[$1] ? a[$1] FS $2 : $2 ; b[$1] = b[$1] ? b[$1] FS $3 : $3}
END { for(idx in a){ print idx,a[idx] ; print idx,b[idx]}}' yourfile
194170 46.9,47.7,48.5,48.6,49.8,50.2
194170 42.2,40.0,42.0,43.0,39.2,43.3
194184 52.8,52.9,54.0,56.8,57.6,57.8
194184 51.1,49.8,51.9,54.9,53.6,52.9
194179 44.9,45.3,46.4,47.5,48.0,49.6
194179 36.9,36.3,36.9,34.4,40.0,37.1

Another awk version, which doesn't use arrays and maintains the original order (not using arrays could be an advantage if it's a very large file that you don't want to have to load all of into memory before printing --- otherwise, the array version is fine, assuming you don't care about the ordering).
BEGIN { FS = OFS = "," }
!prev_id { prev_id = $1 }
$1 == prev_id { r1 = r1 OFS $2; r2 = r2 OFS $3 }
$1 != prev_id { print prev_id r1 ORS prev_id r2;
r1 = OFS $2; r2 = OFS $3; prev_id = $1 }
END { print prev_id r1 ORS prev_id r2 }
$ awk -f v3.awk file.txt
194170,46.9,47.7,48.5,48.6,49.8,50.2
194170,42.2,40.0,42.0,43.0,39.2,43.3
194179,44.9,45.3,46.4,47.5,48.0,49.6
194179,36.9,36.3,36.9,34.4,40.0,37.1
194184,52.8,52.9,54.0,56.8,57.6,57.8
194184,51.1,49.8,51.9,54.9,53.6,52.9

Related

Getting all values of various rows which have the same value in one column with awk

I have a data set (test-file.csv) with tree columns:
node,contact,mail
AAAA,Peter,peter#anything.com
BBBB,Hans,hans#anything.com
CCCC,Dieter,dieter#anything.com
ABABA,Peter,peter#anything.com
CCDDA,Hans,hans#anything.com
I like to extend the header by the column count and rename node to nodes.
Furthermore all entries should be sorted after the second column (mail).
In the column count I like to get the number of occurences of the column mail,
in nodes all the entries having the same value in the column mail should be printed (space separated and alphabetically sorted).
This is what I try to achieve:
contact,mail,count,nodes
Dieter,dieter#anything,com,1,CCCC
Hans,hans#anything.com,2,BBBB CCDDA
Peter,peter#anything,com,2,AAAA ABABA
I have this awk-command:
awk -F"," '
BEGIN{
FS=OFS=",";
printf "%s,%s,%s,%s\n", "contact","mail","count","nodes"
}
NR>1{
counts[$3]++; # Increment count of lines.
contact[$2]; # contact
}
END {
# Iterate over all third-column values.
for (x in counts) {
printf "%s,%s,%s,%s\n", contact[x],x,counts[x],"nodes"
}
}
' test-file.csv | sort --field-separator="," --key=2 -n
However this is my result :-(
Nothing but the amount of occurences work.
,Dieter#anything.com,1,nodes
,hans#anything.com,2,nodes
,peter#anything.com,2,nodes
contact,mail,count,nodes
Any help appreciated!
You may use this gnu awk:
awk '
BEGIN {
FS = OFS = ","
printf "%s,%s,%s,%s\n", "contact","mail","count","nodes"
}
NR > 1 {
++counts[$3] # Increment count of lines.
name[$3] = $2
map[$3] = ($3 in map ? map[$3] " " : "") $1
}
END {
# Iterate over all third-column values.
PROCINFO["sorted_in"]="#ind_str_asc";
for (k in counts)
print name[k], k, counts[k], map[k]
}
' test-file.csv
Output:
contact,mail,count,nodes
Dieter,dieter#anything.com,1,CCCC
Hans,hans#anything.com,2,BBBB CCDDA
Peter,peter#anything.com,2,AAAA ABABA
With your shown samples please try following. Written and tested in GNU awk.
awk '
BEGIN{ FS=OFS="," }
FNR==1{
sub(/^[^,]*,/,"")
$1=$1
print $0,"count,nodes"
}
FNR>1{
nf=$2
mail[nf]=$NF
NF--
arr[nf]++
val[nf]=(val[nf]?val[nf] " ":"")$1
}
END{
for(i in arr){
print i,mail[i],arr[i],val[i] | "sort -t, -k1"
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ FS=OFS="," } ##In BEGIN section setting FS, OFS as comma here.
FNR==1{ ##if this is first line then do following.
sub(/^[^,]*,/,"") ##Substituting everything till 1st comma here with NULL in current line.
$1=$1 ##Reassigning 1st field to itself.
print $0,"count,nodes" ##Printing headers as per need to terminal.
}
FNR>1{ ##If line is Greater than 1st line then do following.
nf=$2 ##Creating nf with 2nd field value here.
mail[nf]=$NF ##Creating mail with nf as index and value is last field value.
NF-- ##Decreasing value of current number of fields by 1 here.
arr[nf]++ ##Creating arr with index of nf and keep increasing its value with 1 here.
val[nf]=(val[nf]?val[nf] " ":"")$1 ##Creating val with index of nf and keep adding $1 value in it.
}
END{ ##Starting END block of this program from here.
for(i in arr){ ##Traversing through arr in here.
print i,mail[i],arr[i],val[i] | "sort -t, -k1" ##printing values to get expected output and sorting it also by pipe here as per requirement.
}
}
' Input_file ##Mentioning Input_file name here.
2nd solution: In case you want to sort by 2nd and 3rd fields then try following.
awk '
BEGIN{ FS=OFS="," }
FNR==1{
sub(/^[^,]*,/,"")
$1=$1
print $0,"count,nodes"
}
FNR>1{
nf=$2 OFS $3
NF--
arr[nf]++
val[nf]=(val[nf]?val[nf] " ":"")$1
}
END{
for(i in arr){
print i,arr[i],val[i] | "sort -t, -k1"
}
}
' Input_file

Counting by analizing two column in difficult pattern in awk by using probably arrays

I have a huge problem. I try to create a script, which counts a specific sum (sum of water bridges never mind). This is a small part of my data file
POP62 SOL11
KAR1 SOL24
KAR5 SOL31
POP17 SOL42
POP15 SOL2
POP17 SOL2
KAR7 SOL42
KAR1 SOL11
KAR6 SOL31
In the first column, I have POP or KAR with numbers like KAR1, POP17, etc. In the second column, I have always SOL with a number, but I have max 2 the same SOL (for example, I can have maximum 2 SOL42 or SOL11 etc., KAR and POP I can have more than 2).
And now the thing that I want to do.
If I find that the same SOL is connected with both KAR and POP (whatever number) I add 1. For example:
KAR6 SOL5
POP8 SOL5
I add one to sum
In my data
POP62 SOL11
KAR1 SOL24
KAR5 SOL31
POP17 SOL42
POP15 SOL2
POP17 SOL2
KAR7 SOL42
KAR1 SOL11
KAR6 SOL31
I should have sum = 2
,because
POP17 SOL42
KAR7 SOL42
and
POP62 SOL11
KAR1 SOL11
Do you have any idea how to do that. I think about using NR=FNR and going through the file two times and check the repetitions in the $2 maybe by using an array, but what next?
#!/bin/bash
awk 'NR==FNR ??
some condition {sum++}
END {print sum}' test1.txt{,} >> water_bridges_x2.txt
Edit solution
I also add 0 if it is empty, because I want print 0 instead of null
awk '
{
s = $1
sub(/[0-9]+$/, "", s) # strip digits from end in var s
if ($2 in map && map[$2] != s) # if existing entry is not same
++sum # increment sum
map[$2] = s
}
END {print sum+0}' file
2
You may try this awk:
awk '
{
s = $1
sub(/[0-9]+$/, "", s) # strip digits from end in var s
if ($2 in map && map[$2] != s) # if existing entry is not same
++sum # increment sum
map[$2] = s
}
END {print sum+0}' file
2
With your shown samples, here is another way of doing it. Written and tested in GNU awk, should work in any awk.
awk '
{
match($1,/^[a-zA-Z]+/)
val=substr($1,RSTART,RLENGTH)
if(($2 in arr) && arr[$2]!=val){
sum++
}
arr[$2]=val
}
END{
print sum
}
' Input_file
A similar answer to #anubhava's: this uses GNU awk for the multi-dimensional array:
gawk '
{sols[$2][substr($1,0,3)] = 1}
END {
for (sol in sols)
if ("POP" in sols[sol] && "KAR" in sols[sol])
sum++
print sum
}
' file
another solution
$ sed -E 's/[0-9]+ +/ /' file | # cleanup data
sort -k2 | # sort by key
uniq | # remove dups
uniq -c -f1 | # count by key
egrep '^ +2 ' -c # report the sum where count is 2.
2

How to find maximum value in a column with awk

I have a file with two sets of data divided by a blank line:
a 3
b 2
c 1
e 5
d 8
f 1
Is there a way to find the maximum value of the second column in each set and print the corresponding line with awk ? The result should be:
b 3
d 8
Thank you.
Could you please try following, written and tested based on your shown samples in GNU awk.
awk '
!NF{
if(max!=""){ print arr[max],max }
max=""
}
{
max=( (max<$2) || (max=="") ? $2 : max )
arr[$2]=$1
}
END{
if(max!=""){ print arr[max],max }
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
!NF{ ##if NF is NULL then do following.
if(max!=""){ print arr[max],max } ##Checking if max is SET then print arr[max] and max.
max="" ##Nullifying max here.
}
{
max=( (max<$2) || (max=="") ? $2 : max ) ##Checking condition if max is greater than 2nd field then keep it as max or change max value as 2nd field.
arr[$2]=$1 ##Creating arr with 2nd field index and 1st field as value.
}
END{ ##Starting END block of this program from here.
if(max!=""){ print arr[max],max } ##Checking if max is SET then print arr[max] and max.
}
' Input_file ##mentioning Input_file name here.
You may use this alternate gnu awk:
awk -v RS= '{
max=""
split($0, a, /[^[:space:]]+/, m)
for (i=1; i in m; i+=2)
if (!max || m[i+1] > max) {
mi = i
max = m[i+1]
}
print m[mi], m[mi+1]
}' file
a 3
d 8
Another awk:
$ awk '
!$0 {
print n
m=n=""
}
$2>m {
m=$2
n=$0
}
END {
print n
}' file
Output:
a 3
d 8
another awk
$ awk '{cmd="sort -k2nr | head -1"} !NF{close(cmd)} {print | cmd}' file
a 3
d 8
runs the command for each block to find the block max.
You could try to separate the data sets by doing:
awk -v RS= 'NR == 1 {print}' yourfile > anotherfile
This will return the first data set then you change NF == 2 to get
the second data set,
and then find the maximum in each data set like suggested in
here

Conditionally replace multiple values in multiple columns

I have a coma delimited file where some values can be missing like
1,f,12,f,t,18
2,t,17,t, ,17
3,t,15, ,f,16
I want to change some of the columns to numeric; f to 0 and t to 1. Here, I want to change only columns 2 and 5 and don't want to change column 4. I my result file should look like
1,0,12,f,1,18
2,1,17,t, ,17
3,1,15, ,0,16
I can use the statement
awk -F, -v OFS=',' '{ if ( $2 ~ /t/ ) { $2 = 1 } else if ( $2 ~ /f/ ) { $2 = 0 }; print}' test.csv
To change individual columns
I can also use a loop like
awk -F, -v OFS=',' 'BEGIN {
IFS = OFS = ","
}
{
for (column = 1; column <= 4; ++column) {
if ($column ~ /t/) {
$column = 1
}
else if($column ~ /f/) {
$column = 0
}
}
print
}
' test.csv
to replace multiple columns if they are together. How do I change the for loop to specify only the specific columns? I know there is a for each loop to do the same but I couldn't get it to work. Also how can I assign multiple variables to the array in a single statement like
a =[1, 2, 3, 4]
You can use this awk:
awk 'BEGIN{ FS=OFS=","; a[2]; a[5] }
{ for (i in a) if ($i=="f") $i=0; else if ($i=="t") $i=1 } 1' file
1,0,12,f,1,18
2,1,17,t, ,17
3,1,15, ,0,16

Awk conditional filter one file based on another (or other solutions)

Programming beginner here needs some help modifying an AWK script to make it conditional. Alternative non-awk solutions are also very welcome.
NOTE Main filtering is now working thanks to help from Birei but I have an additional problem, see note below in question for details.
I have a series of input files with 3 columns like so:
chr4 190499999 190999999
chr6 61999999 62499999
chr1 145499999 145999999
I want to use these rows to filter another file (refGene.txt) and if a row in file one mathces a row in refGene.txt, to output column 13 in refGene.txt to a new file 'ListofGenes_$f'.
The tricky part for me is that I want it to count as a match as long as column one (eg 'chr4', 'chr6', 'chr1' ) and column 2 AND/OR column 3 matches the equivalent columns in the refGene.txt file. The equivalent columns between the two files are $1=$3, $2=$5, $3=$6.
Then I am not sure in awk how to not print the whole row from refGene.txt but only column 13.
NOTE I have achieved the conditional filtering described above thanks to help from Birei. Now I need to incorporate an additional filter condition. I also need to output column $13 from the refGene.txt file if any of the region between value $2 and $3 overlaps with the region between $5 and $6 in the refGene.txt file. This seems a lot trickier as it involves mathmatical computation to see if the regions overlap.
My script so far:
FILES=/files/*txt
for f in $FILES ;
do
awk '
BEGIN {
FS = "\t";
}
FILENAME == ARGV[1] {
pair[ $1, $2, $3 ] = 1;
next;
}
{
if ( pair[ $3, $5, $6 ] == 1 ) {
print $13;
}
}
' $(basename $f) /files/refGene.txt > /files/results/$(basename $f) ;
done
Any help is really appreciated. Thanks so much!
Rubal
One way.
awk '
BEGIN { FS = "\t"; }
## Save third, fifth and seventh field of first file in arguments (refGene.txt) as the key
## to compare later. As value the field to print.
FNR == NR {
pair[ $3, $5, $6 ] = $13;
next;
}
## Set the name of the output file.
FNR == 1 {
output_file = "";
split( ARGV[ARGIND], path, /\// );
for ( i = 1; i < length( path ); i++ ) {
current_file = ( output_file ? "/" : "" ) path[i];
}
output_file = output_file "/ListOfGenes_" path[i];
}
## If $1 = $3, $2 = $5 and $3 = $6, print $13 to output file.
{
if ( pair[ $1, $2, $3 ] ) {
print pair[ $1, $2, $3 ] >output_file;
}
}
' refGene.txt /files/rubal/*.txt

Resources