I am trying to convert heading of file content into column using awk, below is my input file -
abc.txt
1234|43245
4325|65123
5432|12342
bcd.txt
865|432
324|543
123|654
cde.txt
12|321
21|123
32|123
output :
abc.txt|1234|43245
abc.txt|4325|65123
abc.txt|5432|12342
bcd.txt|865|432
bcd.txt|324|543
bcd.txt|123|654
cde.txt|12|321
cde.txt|21|123
cde.txt|32|123
Explanation :
Fetch the filename(abc.txt) where NR==1 and put it into an array a or variable and print it will the file contents and when the file contents are completed create a blank line.
I was trying to create two array a one for NF=1 and another array b for NF>1 and loop on array b to merge file content with array a but still trying to figure out a solution.
In awk:
$ awk 'BEGIN{FS=OFS="|"} NF==1{h=$0} {print (NF==1?"": h OFS $0)}' file
abc.txt|1234|43245
abc.txt|4325|65123
abc.txt|5432|12342
bcd.txt|865|432
bcd.txt|324|543
bcd.txt|123|654
cde.txt|12|321
cde.txt|21|123
cde.txt|32|123
Downside is, that it prints an empty line in the beginning. If you can't live with that, add NR==1{next} before the printing block - or better yet: see #EdMorton's comment below.
Explained:
BEGIN{ FS=OFS="|" } # set delimiters
NF==1{ h=$0 } # if NF==1 it's header time, store it to h
# NR==1{ next } # to remove the leading enter, apply this
{ print (NF==1 ? "" : h OFS $0) } # print an empty record or the record with header
So here is my solution:
awk 'BEGIN{ OFS = "|"} /[a-z].[a-z]/{ if ($0 != header && NR > 1){print ""}; header = $0 }/[0-9]\|[0-9]/{ numbers = $0; print header, numbers }' yourfile
Output:
abc.txt|1234|43245
abc.txt|4325|65123
abc.txt|5432|12342
bcd.txt|865|432
bcd.txt|324|543
bcd.txt|123|654
cde.txt|12|321
cde.txt|21|123
cde.txt|32|123
It is without any arrays but it seems to work.
Use the following awk approach:
awk '{ if($1~/[a-z]+.txt/) { if(NR != 1) {print ""} h=$1;next } print h"|"$1;}' testfile
The output:
abc.txt|1234|43245
abc.txt|4325|65123
abc.txt|5432|12342
bcd.txt|865|432
bcd.txt|324|543
bcd.txt|123|654
cde.txt|12|321
cde.txt|21|123
cde.txt|32|123
Explanation:
if($1~/[a-z]+.txt/) - the condition checks if current column $1 matches the pattern /[a-z]+.txt/(header column)
h=$1;next - if a column matching the pattern is found, saves header value e.g. abc.txt into variable h and skips a header line via next
if(NR != 1) {print ""} - prints a linebreak if it's not the first occurance of a header line
print h"|"$1; - prints a header value with a separator and each next subsequent line
awk -F '|' '{if(NF==2)$0=F"|"$0;else{F=$1;$0=""}}NR>1' YourFile
self commented:
# use | as separator
awk -F '|' '
# for every lines
{
# line with "data" have 2 field
if(NF==2) {
# add File nameand "|" in front of current line
$0 = F"|"$0
}
else {
# File name is field 1
F=$1
# change line to empty line
$0=""
}
}
#print line (in new state, ater 1st line), default action of a trigger
NR>1
' YourFile
Related
I have a data set (test-file.csv) with tree columns:
node,contact,mail
AAAA,Peter,peter#anything.com
BBBB,Hans,hans#anything.com
CCCC,Dieter,dieter#anything.com
ABABA,Peter,peter#anything.com
CCDDA,Hans,hans#anything.com
I like to extend the header by the column count and rename node to nodes.
Furthermore all entries should be sorted after the second column (mail).
In the column count I like to get the number of occurences of the column mail,
in nodes all the entries having the same value in the column mail should be printed (space separated and alphabetically sorted).
This is what I try to achieve:
contact,mail,count,nodes
Dieter,dieter#anything,com,1,CCCC
Hans,hans#anything.com,2,BBBB CCDDA
Peter,peter#anything,com,2,AAAA ABABA
I have this awk-command:
awk -F"," '
BEGIN{
FS=OFS=",";
printf "%s,%s,%s,%s\n", "contact","mail","count","nodes"
}
NR>1{
counts[$3]++; # Increment count of lines.
contact[$2]; # contact
}
END {
# Iterate over all third-column values.
for (x in counts) {
printf "%s,%s,%s,%s\n", contact[x],x,counts[x],"nodes"
}
}
' test-file.csv | sort --field-separator="," --key=2 -n
However this is my result :-(
Nothing but the amount of occurences work.
,Dieter#anything.com,1,nodes
,hans#anything.com,2,nodes
,peter#anything.com,2,nodes
contact,mail,count,nodes
Any help appreciated!
You may use this gnu awk:
awk '
BEGIN {
FS = OFS = ","
printf "%s,%s,%s,%s\n", "contact","mail","count","nodes"
}
NR > 1 {
++counts[$3] # Increment count of lines.
name[$3] = $2
map[$3] = ($3 in map ? map[$3] " " : "") $1
}
END {
# Iterate over all third-column values.
PROCINFO["sorted_in"]="#ind_str_asc";
for (k in counts)
print name[k], k, counts[k], map[k]
}
' test-file.csv
Output:
contact,mail,count,nodes
Dieter,dieter#anything.com,1,CCCC
Hans,hans#anything.com,2,BBBB CCDDA
Peter,peter#anything.com,2,AAAA ABABA
With your shown samples please try following. Written and tested in GNU awk.
awk '
BEGIN{ FS=OFS="," }
FNR==1{
sub(/^[^,]*,/,"")
$1=$1
print $0,"count,nodes"
}
FNR>1{
nf=$2
mail[nf]=$NF
NF--
arr[nf]++
val[nf]=(val[nf]?val[nf] " ":"")$1
}
END{
for(i in arr){
print i,mail[i],arr[i],val[i] | "sort -t, -k1"
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ FS=OFS="," } ##In BEGIN section setting FS, OFS as comma here.
FNR==1{ ##if this is first line then do following.
sub(/^[^,]*,/,"") ##Substituting everything till 1st comma here with NULL in current line.
$1=$1 ##Reassigning 1st field to itself.
print $0,"count,nodes" ##Printing headers as per need to terminal.
}
FNR>1{ ##If line is Greater than 1st line then do following.
nf=$2 ##Creating nf with 2nd field value here.
mail[nf]=$NF ##Creating mail with nf as index and value is last field value.
NF-- ##Decreasing value of current number of fields by 1 here.
arr[nf]++ ##Creating arr with index of nf and keep increasing its value with 1 here.
val[nf]=(val[nf]?val[nf] " ":"")$1 ##Creating val with index of nf and keep adding $1 value in it.
}
END{ ##Starting END block of this program from here.
for(i in arr){ ##Traversing through arr in here.
print i,mail[i],arr[i],val[i] | "sort -t, -k1" ##printing values to get expected output and sorting it also by pipe here as per requirement.
}
}
' Input_file ##Mentioning Input_file name here.
2nd solution: In case you want to sort by 2nd and 3rd fields then try following.
awk '
BEGIN{ FS=OFS="," }
FNR==1{
sub(/^[^,]*,/,"")
$1=$1
print $0,"count,nodes"
}
FNR>1{
nf=$2 OFS $3
NF--
arr[nf]++
val[nf]=(val[nf]?val[nf] " ":"")$1
}
END{
for(i in arr){
print i,arr[i],val[i] | "sort -t, -k1"
}
}
' Input_file
I have two files I need to compare and map a value to which multiple rows match.
My mapping file (map.csv) looks like:
id,name
123,Hans
123,Britta
232,Peter
343,Siggi
343,Horst
The data file (data.csv) is
contact,id,names
m#a.de,123,
ad#23.com,343,
adf#er.org,123,
af#go.er,232,
llk#fh.com,343,
ad#wer.org,789,
The disired output should look like this
contact,id,names
m#a.de,123,Hans Britta
ad#23.com,343,Siggi Horst
adf#er.org,123,Hans Britta
af#go.er,232,Peter
llk#fh.com,343,Siggi Horst
ad#wer.org,789,NO ENTRY
There are multiple values for one ID in the mapping-file and they should be printed space-separated into the column names of the data-file. If there is no ID in the mapping file "NO ENTRY" should be printed instead.
This is the awk-command
awk 'NR==FNR{a[$1];next}{print $0,($2 in a)? a[$2]:"NO ENTRY"}' map.csv data.csv
I clearly fail because I do not know how to loop through the mapping file for getting multiple values to one id (or currently any value at all).
With your shown samples please try following.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
arr[$1]=(arr[$1]?arr[$1] " ":"")$2
next
}
FNR==1{
print
next
}
{
sub(/,$/,"")
print $0,($2 in arr)?arr[$2]:"NO ENTRY"
}
' map.csv data.csv
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=OFS="," ##Setting FS and OFS as comma here.
}
FNR==NR{ ##Checking condition which will be TRUE when map.csv is being read.
arr[$1]=(arr[$1]?arr[$1] " ":"")$2 ##Creating arr with index of $1 and which has value of $2 and keep concatenating its value with same index.
next ##next will skip all further statements from here.
}
FNR==1{ ##Checking condition if this is first line of data.csv then do following.
print ##Printing current line here.
next ##next will skip all further statements from here.
}
{
sub(/,$/,"") ##Substituting last comma with NULL here.
print $0,($2 in arr)?arr[$2]:"NO ENTRY" ##Printing current line and printing either value of arr with index of $2 OR printing NO ENTRY as per requirement.
}
' map.csv data.csv ##Mentioning Input_file names here.
You can use two rules in your case. One to capture the data from map.csv and then a second rule to output the results, e.g.
(edit -- updated to match 1st row of output exactly)
awk -F, '
NR==FNR { if (FNR > 1) a[$1]=a[$1]" "$2; next }
FNR==1 { print; next }
{ printf "%s,%s,%s\n", $1, $2, a[$2]?a[$2]:"NO ENTRY" }
' map.csv data.csv
The first rule is qualified by NR=FNR (current record number equal to the file record number -- e.g. the first file). The second rule is only run on the second file and outputs the heading row unchanged before outputting the aggregated data.
Example Use/Output
You can simply select-copy and middle-mouse-paste the command above into an xterm with the current directory holding map.csv and data.csv which results in the following:
$ awk -F, '
> NR==FNR { if (FNR > 1) a[$1]=a[$1]" "$2; next }
> FNR==1 { print; next }
> { printf "%s,%s,%s\n", $1, $2, a[$2]?a[$2]:"NO ENTRY" }
> ' map.csv data.csv
contact,id,names
m#a.de,123, Hans Britta
ad#23.com,343, Siggi Horst
adf#er.org,123, Hans Britta
af#go.er,232, Peter
llk#fh.com,343, Siggi Horst
ad#wer.org,789,NO ENTRY
Alternative
An alternative that does the exact same thing, but simplifies (slightly) by explicitly setting OFS="," before output begins allowing the use of print instead of printf would be:
awk -F, '
NR==FNR { if (FNR > 1) a[$1]=a[$1]" "$2; next }
FNR==1 { OFS=","; print; next }
{ print $1, $2, a[$2]?a[$2]:"NO ENTRY" }
' map.csv data.csv
(same output)
I have a results.csv file that contains names in the following layout:
name1, 2(random number)
name5, 3
and a sample.txt, that is structured in the following
record_seperator
name1
foo
bar
record_seperator
name2
bla
bluh
I would like to seach for each name in results.csv in the sample.txt file and if it is found output the record into a file.
I tried to generate an array out of the first file and search for that, but I couldn't get the syntax right.
It needs to run in a bash script. If anyone has a better idea than awk, that is also good, but I do not have admin rights on the machine it is supposed to run.
The true csv file contains 10.000 names and the sample.txt 4.5 million records.
I am a bloody beginner in awk, so explanation would be much appreciated.
This is my current try, which does not work and I don't know why:
#!/bin/bash
awk 'BEGIN{
while (getline < "results.csv")
{
split($0,name,",");
nameArr[k]=name[1];
}
{
RS="record_seperator"
FS="\n"
for (key in nameArr)
{
print nameArr[key]
print $2
if ($2==nameArr[key])
NR > 1
{
#extract file by Record separator and name from line2
print RS $0 > $2 ".txt"
}
}
}
}' sample.txt
edit:
my expected output would be two files:
name1.txt
record_seperator
name1
foo
bar
name2.txt
record_seperator
name2
bla
bluh
Here's one. As there was no expected output, it just outputs raw records:
$ awk '
NR==FNR { # process first file
a[$1]=RS $0 # hash the whole record with first field (name) as key
next # process next record in the first file
} # after this line second file processing
$1 in a { # if first field value (name) is found in hash a
f=$1 ".txt" # generate filename
print a[$1] > f # output the whole record
close(f) # preserving fds
}' RS="record_seperator\n" sample RS="\n" FS="," results # file order and related vars
Only one match:
$ cat name1.txt
record_seperator
name1
foo
bar
Tested on gawk and mawk, acts weird on original-awk.
something like this, (not tested)
$ awk -F, 'NR==FNR {a[$1]; next} # fill array with names from first file
$1 in a {print rt, $0 > ($1".txt")} # print the record from second file
{rt = RT}' results.csv RS="define_it_here" sample.txt
since your record separator is before the records, you need to delay it by one.
Use the build in line/record iterator instead of working it around.
You code's errors:
#!/bin/bash
awk 'BEGIN{
while (getline < "results.csv")
{
split($0,name,",");
nameArr[k]=name[1]; ## <-- k not exists, you are rewriting nameArr[""] again and again.
}
{
RS="record_seperator"
FS="\n"
for (key in nameArr) ## <-- only one key "" exists, it's never gonna equal to $2
{
print nameArr[key]
print $2
if ($2==nameArr[key])
NR > 1
{
#extract file by Record separator and name from line2
print RS $0 > $2 ".txt"
}
}
}
}' sample.txt
Also the sample you showed:
name1, 2(random number)
name5, 3 ## <-- name5 here, not name2 !
Changed name5 to name2, and with your own code updated:
#!/bin/bash
awk 'BEGIN{
while ( (getline line< "results.csv") > 0 ) { # Avoid infinite loop when read erorr encountered.
split(line,name,",");
nameArr[name[1]]; # Actually no need do anything, just refer once to establish the key (name[1]).
}
RS="record_seperator";
FS="\n";
}
$2 in nameArr {
print RS $0; #You can add `> $2 ".txt"` later yourself.
}' sample.txt
Output:
record_seperator
name1
foo
bar
record_seperator
name2
bla
bluh
(Following #Tiw's lead, I also changed name5 to name2 in your results file in order to get the expected output)
$ cat a.awk
# collect the result names into an array
NR == FNR {a[$1]; next}
# skip the first (empty) sample record caused by initial record separator
FNR == 1 { next }
# If found, output sample record into the appropriate file
$1 in a {
f = ($1 ".txt")
printf "record_seperator\n%s", $0 > f
}
Run with gawk for multi-character RS:
$ gawk -f a.awk FS="," results.csv FS="\n" RS="record_seperator\n" sample.txt
Check results:
$ cat name1.txt
record_seperator
name1
foo
bar
$ cat name2.txt
record_seperator
name2
bla
bluh
I searched the web for hours, please excuse me if I overlooked something. I'm a beginner. I want to copy lines that include a certain string from file1 to file2. These lines from file 1 have to be inserted in file2, but only in specific lines that include another string.
(It's about the entire lines with the timecode)
Content of file1:
1
00:00:16,520 --> 00:00:23,200
Some text
2
00:00:25,800 --> 00:00:32,600
Some more text
Content of file2:
1
00: 00: 16,520 -> 00: 00: 23,200
Different text
2
00: 00: 25,720 -> 00: 00: 32,520
More different text
awk '/ --> /' file1 lists the lines I need from file1. But what do I have to add to the code to take these awk results and copy them only into the lines of file2 that include '/ -> /'??
Thanks a lot for your support!!!
Result in file2 should be:
1
00:00:16,520 --> 00:00:23,200
Different text
2
00:00:25,800 --> 00:00:32,600
More different text
Note: below is for GNU awk
So you wanna replace timeline of subtitles, right?
Given that they're indentically indexed, i.e. the number above the timecode are the same.
Then you can try this:
awk 'ARGIND==1 && /^[0-9]+$/{getline timeline; tl[$0]=timeline;}ARGIND==2 &&/^[0-9]+$/{getline tmp2drop; print $0 ORS tl[$0];} ' file1 file2
Note that /^[0-9]+$/ is the criterial, which match a whole line with a number only.
But if you have such subtitle text exists, then it will leads to error replace.
Another way is to use the line number(FNR denoted) as index:
awk 'ARGIND==1 && /-->/{tl[FNR]=$0} ARGIND==2 {if (/->/) print tl[FNR]; else print $0} ' file1 file2
But if the line number are not the same between two files, for example some subtitle texts are multiline, it still will replace wronly.
Given the occurances are at the relatively same places, we can manage a index on our own:
awk 'ARGIND==1 && /-->/{tl[i++]=$0} ARGIND==2 {if (/->/) print tl[j++]; else print $0} ' file1 file2
None of these are perfect, but to give you an idea how you could do the thing.
Choose depends on your situation, and improve the code yourself :)
note: They are just print to console, if you want replace the file. you can use > or '>>` to print the output to a temp file, and later rename to file2.
For example:
awk 'ARGIND==1 && /-->/{tl[i++]=$0} ARGIND==2 {if (/->/) print tl[j++]; else print $0} ' file1 file2 >> tmpFile2check
If you are not using GNU awk, ARGIND==1 won't work, then use this:
awk 'NR==FNR && /-->/{tl[i++]=$0} NR>FNR {if (/->/) print tl[j++]; else print $0} ' file1 file2 >> tmpFile2check
NR means the Number of Records, FNR means current File's Number of Records. If they are equal then it's the first file the script is dealing with. If NR>FNR means it's not the first file.
Note if file1 is or could be empty, then this mechanism will fail, then you should change to FILENAME=="file1" or other file checking method to avoid error processing.
I have two large files of 80,000 plus records that are identical in length. I need to compare the two files line by line by the first 8 characters of the file. Line one of file one is to be compared to line one of file two. Line two of file one is to be compared to line two of file two.
Sample file1
01234567blah blah1
11234567blah blah2
21234567blah blah3
31234567blah blah4
Sample file2
31234567blah nomatch
11234567matchme2
21234567matchme3
31234567matchme4
Lines 2 - 4 should match but line 1 should not. My script matches line 1 to line 4 but should be compared to just line 1.
awk '
FNR==NR {
a[substr($0,1,8)]=1;next
}
{if (a[substr($0,1,8)])print $0; else print "Not Found", $0;}
' $inputfile1 $inputfile2 > $outputfile1
Thank you.
For line by line compare you need to use FNR variable as key. Try:
awk 'NR==FNR{a[FNR]=substr($1,1,8);next}{print (a[FNR]==substr($1,1,8)?$0:"Not Found")}' file1 file2
Not Found
11234567matchme2
21234567matchme3
31234567matchme4
awk 'BEGIN{
while(1){
f=getline<"file1";
if(f!=1)exit;
a=substr($0,1,8);
f=getline<"file2";
if(f!=1)exit;
b=substr($0,1,8);
print a==b?$0:"Not Found"FS$0}}'
Reads one line from file1 if successful stores the substring in a then one line from file2 if successful stores the substring in b, then checks whether a and b are equal or not and prints the output.
Output:
Not Found 31234567blah nomatch
11234567matchme2
21234567matchme3
31234567matchme4
If there's a single char not in either file you could use as a delimiter, like : in your example, and a paste/awk combo like:
paste -d: data data2 | awk -F: '{prefix=substr($1,1,8)!=substr($2,1,8) ? "Not Found"OFS : ""; print prefix $2}'
paste joins the corresponding lines from each file into one line, with a : separator
awk uses the : delimiter
awk tests for a match on the first 8 chars of each field and creates prefix
awk prints out every line with a prefix that's "Not Found" (+OFS) when they don't match.