Compare two files with awk and map congregated multiple values - loops

I have two files I need to compare and map a value to which multiple rows match.
My mapping file (map.csv) looks like:
id,name
123,Hans
123,Britta
232,Peter
343,Siggi
343,Horst
The data file (data.csv) is
contact,id,names
m#a.de,123,
ad#23.com,343,
adf#er.org,123,
af#go.er,232,
llk#fh.com,343,
ad#wer.org,789,
The disired output should look like this
contact,id,names
m#a.de,123,Hans Britta
ad#23.com,343,Siggi Horst
adf#er.org,123,Hans Britta
af#go.er,232,Peter
llk#fh.com,343,Siggi Horst
ad#wer.org,789,NO ENTRY
There are multiple values for one ID in the mapping-file and they should be printed space-separated into the column names of the data-file. If there is no ID in the mapping file "NO ENTRY" should be printed instead.
This is the awk-command
awk 'NR==FNR{a[$1];next}{print $0,($2 in a)? a[$2]:"NO ENTRY"}' map.csv data.csv
I clearly fail because I do not know how to loop through the mapping file for getting multiple values to one id (or currently any value at all).

With your shown samples please try following.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
arr[$1]=(arr[$1]?arr[$1] " ":"")$2
next
}
FNR==1{
print
next
}
{
sub(/,$/,"")
print $0,($2 in arr)?arr[$2]:"NO ENTRY"
}
' map.csv data.csv
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=OFS="," ##Setting FS and OFS as comma here.
}
FNR==NR{ ##Checking condition which will be TRUE when map.csv is being read.
arr[$1]=(arr[$1]?arr[$1] " ":"")$2 ##Creating arr with index of $1 and which has value of $2 and keep concatenating its value with same index.
next ##next will skip all further statements from here.
}
FNR==1{ ##Checking condition if this is first line of data.csv then do following.
print ##Printing current line here.
next ##next will skip all further statements from here.
}
{
sub(/,$/,"") ##Substituting last comma with NULL here.
print $0,($2 in arr)?arr[$2]:"NO ENTRY" ##Printing current line and printing either value of arr with index of $2 OR printing NO ENTRY as per requirement.
}
' map.csv data.csv ##Mentioning Input_file names here.

You can use two rules in your case. One to capture the data from map.csv and then a second rule to output the results, e.g.
(edit -- updated to match 1st row of output exactly)
awk -F, '
NR==FNR { if (FNR > 1) a[$1]=a[$1]" "$2; next }
FNR==1 { print; next }
{ printf "%s,%s,%s\n", $1, $2, a[$2]?a[$2]:"NO ENTRY" }
' map.csv data.csv
The first rule is qualified by NR=FNR (current record number equal to the file record number -- e.g. the first file). The second rule is only run on the second file and outputs the heading row unchanged before outputting the aggregated data.
Example Use/Output
You can simply select-copy and middle-mouse-paste the command above into an xterm with the current directory holding map.csv and data.csv which results in the following:
$ awk -F, '
> NR==FNR { if (FNR > 1) a[$1]=a[$1]" "$2; next }
> FNR==1 { print; next }
> { printf "%s,%s,%s\n", $1, $2, a[$2]?a[$2]:"NO ENTRY" }
> ' map.csv data.csv
contact,id,names
m#a.de,123, Hans Britta
ad#23.com,343, Siggi Horst
adf#er.org,123, Hans Britta
af#go.er,232, Peter
llk#fh.com,343, Siggi Horst
ad#wer.org,789,NO ENTRY
Alternative
An alternative that does the exact same thing, but simplifies (slightly) by explicitly setting OFS="," before output begins allowing the use of print instead of printf would be:
awk -F, '
NR==FNR { if (FNR > 1) a[$1]=a[$1]" "$2; next }
FNR==1 { OFS=","; print; next }
{ print $1, $2, a[$2]?a[$2]:"NO ENTRY" }
' map.csv data.csv
(same output)

Related

Getting all values of various rows which have the same value in one column with awk

I have a data set (test-file.csv) with tree columns:
node,contact,mail
AAAA,Peter,peter#anything.com
BBBB,Hans,hans#anything.com
CCCC,Dieter,dieter#anything.com
ABABA,Peter,peter#anything.com
CCDDA,Hans,hans#anything.com
I like to extend the header by the column count and rename node to nodes.
Furthermore all entries should be sorted after the second column (mail).
In the column count I like to get the number of occurences of the column mail,
in nodes all the entries having the same value in the column mail should be printed (space separated and alphabetically sorted).
This is what I try to achieve:
contact,mail,count,nodes
Dieter,dieter#anything,com,1,CCCC
Hans,hans#anything.com,2,BBBB CCDDA
Peter,peter#anything,com,2,AAAA ABABA
I have this awk-command:
awk -F"," '
BEGIN{
FS=OFS=",";
printf "%s,%s,%s,%s\n", "contact","mail","count","nodes"
}
NR>1{
counts[$3]++; # Increment count of lines.
contact[$2]; # contact
}
END {
# Iterate over all third-column values.
for (x in counts) {
printf "%s,%s,%s,%s\n", contact[x],x,counts[x],"nodes"
}
}
' test-file.csv | sort --field-separator="," --key=2 -n
However this is my result :-(
Nothing but the amount of occurences work.
,Dieter#anything.com,1,nodes
,hans#anything.com,2,nodes
,peter#anything.com,2,nodes
contact,mail,count,nodes
Any help appreciated!
You may use this gnu awk:
awk '
BEGIN {
FS = OFS = ","
printf "%s,%s,%s,%s\n", "contact","mail","count","nodes"
}
NR > 1 {
++counts[$3] # Increment count of lines.
name[$3] = $2
map[$3] = ($3 in map ? map[$3] " " : "") $1
}
END {
# Iterate over all third-column values.
PROCINFO["sorted_in"]="#ind_str_asc";
for (k in counts)
print name[k], k, counts[k], map[k]
}
' test-file.csv
Output:
contact,mail,count,nodes
Dieter,dieter#anything.com,1,CCCC
Hans,hans#anything.com,2,BBBB CCDDA
Peter,peter#anything.com,2,AAAA ABABA
With your shown samples please try following. Written and tested in GNU awk.
awk '
BEGIN{ FS=OFS="," }
FNR==1{
sub(/^[^,]*,/,"")
$1=$1
print $0,"count,nodes"
}
FNR>1{
nf=$2
mail[nf]=$NF
NF--
arr[nf]++
val[nf]=(val[nf]?val[nf] " ":"")$1
}
END{
for(i in arr){
print i,mail[i],arr[i],val[i] | "sort -t, -k1"
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ FS=OFS="," } ##In BEGIN section setting FS, OFS as comma here.
FNR==1{ ##if this is first line then do following.
sub(/^[^,]*,/,"") ##Substituting everything till 1st comma here with NULL in current line.
$1=$1 ##Reassigning 1st field to itself.
print $0,"count,nodes" ##Printing headers as per need to terminal.
}
FNR>1{ ##If line is Greater than 1st line then do following.
nf=$2 ##Creating nf with 2nd field value here.
mail[nf]=$NF ##Creating mail with nf as index and value is last field value.
NF-- ##Decreasing value of current number of fields by 1 here.
arr[nf]++ ##Creating arr with index of nf and keep increasing its value with 1 here.
val[nf]=(val[nf]?val[nf] " ":"")$1 ##Creating val with index of nf and keep adding $1 value in it.
}
END{ ##Starting END block of this program from here.
for(i in arr){ ##Traversing through arr in here.
print i,mail[i],arr[i],val[i] | "sort -t, -k1" ##printing values to get expected output and sorting it also by pipe here as per requirement.
}
}
' Input_file ##Mentioning Input_file name here.
2nd solution: In case you want to sort by 2nd and 3rd fields then try following.
awk '
BEGIN{ FS=OFS="," }
FNR==1{
sub(/^[^,]*,/,"")
$1=$1
print $0,"count,nodes"
}
FNR>1{
nf=$2 OFS $3
NF--
arr[nf]++
val[nf]=(val[nf]?val[nf] " ":"")$1
}
END{
for(i in arr){
print i,arr[i],val[i] | "sort -t, -k1"
}
}
' Input_file

How to get count of word occurrence at specific location in delimited file in Unix

I have pipe delimited file, I want to check for value 'America' at 5th position (column) of each record.
America word can appear in any other columns as well, so grep -o is not giving correct result.
Any other way to check the occurrence of word at specific location when file is delimited?
Script like this can do the work
awk '$5="America" {print}' filename|wc -l
Of course you can do it only with awk like this:
awk 'BEGIN {i=0} $5="America" {i+=1} END {print i}' filename
To use different delimiter you can use in awk this way:
awk -F\| 'BEGIN {i=0} $5="America" {i+=1} END {print i}' filename

How to convert heading of file content into column using awk?

I am trying to convert heading of file content into column using awk, below is my input file -
abc.txt
1234|43245
4325|65123
5432|12342
bcd.txt
865|432
324|543
123|654
cde.txt
12|321
21|123
32|123
output :
abc.txt|1234|43245
abc.txt|4325|65123
abc.txt|5432|12342
bcd.txt|865|432
bcd.txt|324|543
bcd.txt|123|654
cde.txt|12|321
cde.txt|21|123
cde.txt|32|123
Explanation :
Fetch the filename(abc.txt) where NR==1 and put it into an array a or variable and print it will the file contents and when the file contents are completed create a blank line.
I was trying to create two array a one for NF=1 and another array b for NF>1 and loop on array b to merge file content with array a but still trying to figure out a solution.
In awk:
$ awk 'BEGIN{FS=OFS="|"} NF==1{h=$0} {print (NF==1?"": h OFS $0)}' file
abc.txt|1234|43245
abc.txt|4325|65123
abc.txt|5432|12342
bcd.txt|865|432
bcd.txt|324|543
bcd.txt|123|654
cde.txt|12|321
cde.txt|21|123
cde.txt|32|123
Downside is, that it prints an empty line in the beginning. If you can't live with that, add NR==1{next} before the printing block - or better yet: see #EdMorton's comment below.
Explained:
BEGIN{ FS=OFS="|" } # set delimiters
NF==1{ h=$0 } # if NF==1 it's header time, store it to h
# NR==1{ next } # to remove the leading enter, apply this
{ print (NF==1 ? "" : h OFS $0) } # print an empty record or the record with header
So here is my solution:
awk 'BEGIN{ OFS = "|"} /[a-z].[a-z]/{ if ($0 != header && NR > 1){print ""}; header = $0 }/[0-9]\|[0-9]/{ numbers = $0; print header, numbers }' yourfile
Output:
abc.txt|1234|43245
abc.txt|4325|65123
abc.txt|5432|12342
bcd.txt|865|432
bcd.txt|324|543
bcd.txt|123|654
cde.txt|12|321
cde.txt|21|123
cde.txt|32|123
It is without any arrays but it seems to work.
Use the following awk approach:
awk '{ if($1~/[a-z]+.txt/) { if(NR != 1) {print ""} h=$1;next } print h"|"$1;}' testfile
The output:
abc.txt|1234|43245
abc.txt|4325|65123
abc.txt|5432|12342
bcd.txt|865|432
bcd.txt|324|543
bcd.txt|123|654
cde.txt|12|321
cde.txt|21|123
cde.txt|32|123
Explanation:
if($1~/[a-z]+.txt/) - the condition checks if current column $1 matches the pattern /[a-z]+.txt/(header column)
h=$1;next - if a column matching the pattern is found, saves header value e.g. abc.txt into variable h and skips a header line via next
if(NR != 1) {print ""} - prints a linebreak if it's not the first occurance of a header line
print h"|"$1; - prints a header value with a separator and each next subsequent line
awk -F '|' '{if(NF==2)$0=F"|"$0;else{F=$1;$0=""}}NR>1' YourFile
self commented:
# use | as separator
awk -F '|' '
# for every lines
{
# line with "data" have 2 field
if(NF==2) {
# add File nameand "|" in front of current line
$0 = F"|"$0
}
else {
# File name is field 1
F=$1
# change line to empty line
$0=""
}
}
#print line (in new state, ater 1st line), default action of a trigger
NR>1
' YourFile

awk command to read inputs from two files if some fields are equal between the two files

how can I read input from first file say file1.txt and print column 3 $3 from file2.txt if first file $1 is equal to $2 in second file?
if '$1 in file1.txt == $1 in file file2.txt {print $3 from file2.txt}'
I couldn't find simple and straight forward solution to question?
It's pretty straight-forward:
awk 'FNR == NR { a[FNR] = $1; next }
FNR != NR { if (a[FNR] == $2) print $3 }' file1.txt file2.txt
The first line saves the value of $1 for each line in file1.txt (and skips the rest of the script).
The second line doesn't formally need the FNR!=NR condition, but I think it makes it clearer. It processes file2.txt. If the value in $2 is equal to the corresponding saved value, print $3.
If the files are too big to save the $1 values from file1.txt in memory, you should have said so and you have to work harder. It can still be done with awk; it just isn't as neat and tidy and awk-ish.

Match two files by column line by line - no key

I have two large files of 80,000 plus records that are identical in length. I need to compare the two files line by line by the first 8 characters of the file. Line one of file one is to be compared to line one of file two. Line two of file one is to be compared to line two of file two.
Sample file1
01234567blah blah1
11234567blah blah2
21234567blah blah3
31234567blah blah4
Sample file2
31234567blah nomatch
11234567matchme2
21234567matchme3
31234567matchme4
Lines 2 - 4 should match but line 1 should not. My script matches line 1 to line 4 but should be compared to just line 1.
awk '
FNR==NR {
a[substr($0,1,8)]=1;next
}
{if (a[substr($0,1,8)])print $0; else print "Not Found", $0;}
' $inputfile1 $inputfile2 > $outputfile1
Thank you.
For line by line compare you need to use FNR variable as key. Try:
awk 'NR==FNR{a[FNR]=substr($1,1,8);next}{print (a[FNR]==substr($1,1,8)?$0:"Not Found")}' file1 file2
Not Found
11234567matchme2
21234567matchme3
31234567matchme4
awk 'BEGIN{
while(1){
f=getline<"file1";
if(f!=1)exit;
a=substr($0,1,8);
f=getline<"file2";
if(f!=1)exit;
b=substr($0,1,8);
print a==b?$0:"Not Found"FS$0}}'
Reads one line from file1 if successful stores the substring in a then one line from file2 if successful stores the substring in b, then checks whether a and b are equal or not and prints the output.
Output:
Not Found 31234567blah nomatch
11234567matchme2
21234567matchme3
31234567matchme4
If there's a single char not in either file you could use as a delimiter, like : in your example, and a paste/awk combo like:
paste -d: data data2 | awk -F: '{prefix=substr($1,1,8)!=substr($2,1,8) ? "Not Found"OFS : ""; print prefix $2}'
paste joins the corresponding lines from each file into one line, with a : separator
awk uses the : delimiter
awk tests for a match on the first 8 chars of each field and creates prefix
awk prints out every line with a prefix that's "Not Found" (+OFS) when they don't match.

Resources