AWK script to process one file and read another - arrays

I have written an AWK script to process a text file, and now need to extend it so the output from the processing takes data from another file, based on a field in the first file. Here is an example of what I mean;
File1.txt
abc123~17~yy~12345678
abc456~12~yy~23456789
abc789~34~zz~12345678
File2.txt
abc123~11~22~33~ABC-57
abc456~22~11~33~ABC-99
abc789~33~22~11~ABC-12
My current awk script extracts and processes each line from the File1.txt whose 4th field is '12345678', so it finds 2 lines.
I now want to extend this, so from the line I have found, say
abc123~xx~yy~12345678
we take the abc123 and search for that in File2.txt and print the 4th field of that line as well.
Eg.
My awk script will search for a token in field 4 of File1.txt then print thata long with field 1, and field 4 of File2.txt for the line that relates to Field 1 from File1.txt
So if we are searching for 12345678, my output would be
12345678 abc123 ABC-57 17
12345678 abc789 ABC-12 34
(The 17 and 34 have come from field 2 in File1.txt).
In summary then, search for a string in Field 4 of File1.txt, find a line in File2.txt where Field 1 in File1.txt matches Field 1 in File1.txt. Then print
File.Field4 File1.Field1 File2.Field4 File1.Field2
I hope that is clear.
I tried to grep for the 'abc123' string in File2.txt then select the 4th field. This did not seem to work, and now I think an AWK array of File2.txt that indexes on field 1 and stores field 4 might do it.
I am not sure how to go about this though.
(Note, this is a stripped-down example of what I want to do, my real requirement has more data in the files).

This one liner will do the trick:
$ awk -F'~' -v s='12345678' 'FNR==NR&&$4==s{a[$1];next}($1 in a){print s,$1,$5}' file1 file2
12345678 abc123 ABC-57
12345678 abc789 ABC-12
Explanation:
We set the field separator as ~ using the -F option and the value of the variable s to the string we want to match using the -v option.
As a script with some explanatory comments:
BEGIN { FS="~" } # Set the field separator.
FNR==NR && $4==s { # If we are in the first file and fourth field equals s
a[$1] # Create index of field one
next # Skip to next line
}
($1 in a) { # If field one in file2 is in index
print v,$1,$5 # Print v, field 1 and field 5
}
You would run this like awk -v '12345678' -f script.awk file1 file2.

This looks to be the solution I wanted;
BEGIN { FS="~" } # Set the field separator.
FNR==NR && $4==s { # If we are in the first file and fourth field equals s
a[$1] # Create index of field one
field2[$1]=$2
next # Skip to next line
}
($1 in a) { # If field one in file2 is in index
print s,$1,$5,field2[$1] # Print v, field 1 and field 5
}
I think that is correct.
My understanding of the solution is this. First it processes File1 in the first block of code, and I can store the data I want in arrays.
It then processes File 2 in the second block of code conditionally on $1 being in array a. If it is, then output the data, and access the field2 array from File 1.
Problem solved, and my real AWK script works a treat.
Many thanks for the help.

Related

What's the unix command to copy specific lines from one file to another file?

I searched the web for hours, please excuse me if I overlooked something. I'm a beginner. I want to copy lines that include a certain string from file1 to file2. These lines from file 1 have to be inserted in file2, but only in specific lines that include another string.
(It's about the entire lines with the timecode)
Content of file1:
1
00:00:16,520 --> 00:00:23,200
Some text
2
00:00:25,800 --> 00:00:32,600
Some more text
Content of file2:
1
00: 00: 16,520 -> 00: 00: 23,200
Different text
2
00: 00: 25,720 -> 00: 00: 32,520
More different text
awk '/ --> /' file1 lists the lines I need from file1. But what do I have to add to the code to take these awk results and copy them only into the lines of file2 that include '/ -> /'??
Thanks a lot for your support!!!
Result in file2 should be:
1
00:00:16,520 --> 00:00:23,200
Different text
2
00:00:25,800 --> 00:00:32,600
More different text
Note: below is for GNU awk
So you wanna replace timeline of subtitles, right?
Given that they're indentically indexed, i.e. the number above the timecode are the same.
Then you can try this:
awk 'ARGIND==1 && /^[0-9]+$/{getline timeline; tl[$0]=timeline;}ARGIND==2 &&/^[0-9]+$/{getline tmp2drop; print $0 ORS tl[$0];} ' file1 file2
Note that /^[0-9]+$/ is the criterial, which match a whole line with a number only.
But if you have such subtitle text exists, then it will leads to error replace.
Another way is to use the line number(FNR denoted) as index:
awk 'ARGIND==1 && /-->/{tl[FNR]=$0} ARGIND==2 {if (/->/) print tl[FNR]; else print $0} ' file1 file2
But if the line number are not the same between two files, for example some subtitle texts are multiline, it still will replace wronly.
Given the occurances are at the relatively same places, we can manage a index on our own:
awk 'ARGIND==1 && /-->/{tl[i++]=$0} ARGIND==2 {if (/->/) print tl[j++]; else print $0} ' file1 file2
None of these are perfect, but to give you an idea how you could do the thing.
Choose depends on your situation, and improve the code yourself :)
note: They are just print to console, if you want replace the file. you can use > or '>>` to print the output to a temp file, and later rename to file2.
For example:
awk 'ARGIND==1 && /-->/{tl[i++]=$0} ARGIND==2 {if (/->/) print tl[j++]; else print $0} ' file1 file2 >> tmpFile2check
If you are not using GNU awk, ARGIND==1 won't work, then use this:
awk 'NR==FNR && /-->/{tl[i++]=$0} NR>FNR {if (/->/) print tl[j++]; else print $0} ' file1 file2 >> tmpFile2check
NR means the Number of Records, FNR means current File's Number of Records. If they are equal then it's the first file the script is dealing with. If NR>FNR means it's not the first file.
Note if file1 is or could be empty, then this mechanism will fail, then you should change to FILENAME=="file1" or other file checking method to avoid error processing.

edit a file in a specific column of a splited file bash shell

I have an issue where a user gives the file, column, value, and id of the line. I am trying to change the value of the line
The format of the file is:
F1|F2|F3|F4|F5|F6|F7|F8
My thought of doing that is reading the file and put the values of each field in an array. Then I will find the line I want to change using if and I will use awk
while IFS=$'|t' read -r -a myArray
do
if [ $4 == ${myArray[0]} ]; then
echo "${myArray[1]} ${myArray[2]} ${myArray[4]}"
awk -v column="$5" -v value="$6"-F '{ ${myArray[column]} = value }'
echo "${myArray[1]} ${myArray[2]} ${myArray[4]}"
echo "${column} ${value}"
fi
done < $2
However, when I do that nothing changes: the column and value arguments don't print anything.
Any ideas?
You didnt give too much information. Assume you want to change specific column which field2 is F2 you can do as below:
$2=="F2" is checking field 2 is matching your specific string.
$2="Hello" is assigning "Hello" to field 2
$1=$1 reassign the whole record(line)
print print out the whole record
awk -F"|" 'BEGIN{OFS="|"} ($2=="F2"){$2="Hello";$1=$1; print}' sample.csv
See my example:
$cat sample.csv
F1|F2|F3|F4|F5|F6|F7|F8
$awk -F"|" 'BEGIN{OFS="|"} ($2=="F2"){$2="Hello";$1=$1; print}' sample.csv
F1|Hello|F3|F4|F5|F6|F7|F8

Match two files by column line by line - no key

I have two large files of 80,000 plus records that are identical in length. I need to compare the two files line by line by the first 8 characters of the file. Line one of file one is to be compared to line one of file two. Line two of file one is to be compared to line two of file two.
Sample file1
01234567blah blah1
11234567blah blah2
21234567blah blah3
31234567blah blah4
Sample file2
31234567blah nomatch
11234567matchme2
21234567matchme3
31234567matchme4
Lines 2 - 4 should match but line 1 should not. My script matches line 1 to line 4 but should be compared to just line 1.
awk '
FNR==NR {
a[substr($0,1,8)]=1;next
}
{if (a[substr($0,1,8)])print $0; else print "Not Found", $0;}
' $inputfile1 $inputfile2 > $outputfile1
Thank you.
For line by line compare you need to use FNR variable as key. Try:
awk 'NR==FNR{a[FNR]=substr($1,1,8);next}{print (a[FNR]==substr($1,1,8)?$0:"Not Found")}' file1 file2
Not Found
11234567matchme2
21234567matchme3
31234567matchme4
awk 'BEGIN{
while(1){
f=getline<"file1";
if(f!=1)exit;
a=substr($0,1,8);
f=getline<"file2";
if(f!=1)exit;
b=substr($0,1,8);
print a==b?$0:"Not Found"FS$0}}'
Reads one line from file1 if successful stores the substring in a then one line from file2 if successful stores the substring in b, then checks whether a and b are equal or not and prints the output.
Output:
Not Found 31234567blah nomatch
11234567matchme2
21234567matchme3
31234567matchme4
If there's a single char not in either file you could use as a delimiter, like : in your example, and a paste/awk combo like:
paste -d: data data2 | awk -F: '{prefix=substr($1,1,8)!=substr($2,1,8) ? "Not Found"OFS : ""; print prefix $2}'
paste joins the corresponding lines from each file into one line, with a : separator
awk uses the : delimiter
awk tests for a match on the first 8 chars of each field and creates prefix
awk prints out every line with a prefix that's "Not Found" (+OFS) when they don't match.

Using Bash array in AWK

I have two files as follows:
file1:
3 1
2 4
2 1
file2:
23
9
7
45
The second field of file1 is used to specify the line of file2 that contains the number to be retrieved and printed. In the desired output, the first field of file1 is printed and then the retrieved field is printed.
Desired output file:
3 23
2 45
2 23
Here is my attempt to solve this problem:
IFS=$'\r\n' baf2=($(cat file2));echo;awk -v av="${baf2[*]}" 'BEGIN {split(av, aaf2, / /)}{print $1, aaf2[$2]}' file1;echo;echo ${baf2[*]}
However, this script cannot use the Bash array baf2.
The solution must be efficient since file1 has billions of lines and file2 has millions of lines in the real case.
This has a similar basis to Jotne's solution, but loads file2 into memory first (since it is smaller than file1):
awk 'FNR==NR{x[FNR]=$0;next}{print $1 FS x[$2]}' file2 file1
Explanation
The FNR==NR part means that the part that follows in curly braces is only executed when reading file2, not file1. As each line of file2 is read, it is saved in array x[] as indexed by the current line number. The part in the second set of curly braces is executed for every line of file1 and it prints the first field on the line followed by the field separator (space) followed by the entry in x[] as indexed by the second field on the line.
Using awk
1) print all lines in file1, whatever if there is match or not
awk 'NR==FNR{a[NR]=$1;next}{print $1,a[$2]}' file2 file1
2) print match lines only
awk 'NR==FNR{a[NR]=$1;next}$2=a[$2]' file2 file1
You can use this awk
awk 'FNR==NR {a[NR]=$1;next} {print $1,a[$2]}' file2 file1
3 23
2 45
2 23
Sorte file2 in array a.
Then print field 1 from file1 and use field 2 to look up in array.

replacing lines of a text file with text of another file using sed or awk

I have a text file e.g File1.txt and I want to replace its few lines with new lines available in another text file e.g File2.txt. The format of File1.txt is as below It has pointers start and end.
START
line 1
line 2
line 3
line 4
line 5
END
I want to change line 1 to line 5 with the lines available in File2.txt. The number of lines are not equal in File1.txt and File2.txt. File2.txt may have more or less lines as in File1.txt.
I need input from someone. Thanking in anticipation
If the parts of File1.txt that you want to preserve are fixed,
you only need to print the second file and include that parts:
printf 'BEGIN\n\n%s\n\nEND\n' "$(<File2.txt)"
IF that's not the case (substitute START/END with the patterns
that match the parts that you want to preserve):
awk 'NR == FNR {
f2 = f2 ? f2 RS $0 : $0
next
}
/START|END/ || !NF {
print; next
}
NF && !c++ {
print f2
}' File2.txt File1.txt
This GNU sed one liner might work:
sed -re '/^START/,/^END/{/^START/{p;r File2.txt' -e '};/^END/p;d}' File1.txt
This inserts File2.txt between START and END but doesn't preserve empty lines after line 1 and before line 2
This tries to preserve empty lines:
sed -re '/^START/,/^END/{//!{/^$/{p;d};x;/./{x;d};x;h;r File2.txt' -e ';d};x;s/.*//;x}' File1.txt

Resources