Split file in Unix based on occurence of some specific string - file

Contents of my file is as following
Tenor|CurrentCoupon
15Y|3.091731898890382
30Y|3.5773546584901617
Id|Cusip|Ticker|Status|Error|AsOfDate|Price|LiborOas
1|01F020430|FN 15 2 F0|1||20180312|95.19140625|-0.551161358515
2|01F020448|FN 15 2 F1|1||20180312|95.06640625|1.18958768351
3|01F020547|FN 20 2 F0|1||20180312|90.484375|50.742896921
4|01F020554|FN 20 2 F1|1||20180312|90.359375|52.4642397071
5|01F020646|FN 30 2 F0|1||20180312|90.25|6.26649840403
and I have to split it into 2 files like
Tenor,CurrentCoupon
15Y,3.294202313
30Y,3.727696014
and
Id,Cusip,Ticker,Status,Error,AsOfDate,Price,LiborOas
1,01F020489,FN 15 2 F0,1,,20180807,94.27734375,6.199343069
2,01F020497,FN 15 2 F1,1,,20180807,94.15234375,8.225144379
3,01F020588,FN 20 2 F0,1,,20180807,89.984375,48.11248894
I have very little knowledge of UNIX scripts. The number of rows will vary.

Using awk you can do something very simple
awk -F '|' '{print $0 > NF ".txt"}' yourfile.txt
This command will split your file into 2.txt (all rows containing 2 columns) and 8.txt (all rows containing 8 columns)
To understand this command, -F option sets the delimiter, awk will parse your file line by line, $0 stands for the entire row, NF for the number of fields in the parsed row.
If you want to change the delimiter from | to , :
awk -F '|' 'BEGIN{OFS=","};{$1=$1; print > NF ".txt"}' yourfile.txt
OFS stands for Output File Separator, $1=$1 is an ugly hack to rebuild your row with the right separator ^^

Related

How do I split a text file into an array by blank lines?

I have a bash command that outputs text in the following format:
Header 1
- Point 1
- Point 2
Header 2
- Point 1
- Point 2
Header 3
-Point 1
- Point 2
...
I want to parse this text into an array, separating on the empty line so that array[0] for example contains:
Header 1
- Point 1
- Point 2
And then I want to edit some of the data in the array if it satisfies certain conditions.
I was looking at something like this Separate by blank lines in bash but I'm completely new to bash so I don't understand how to save the output from awk RS=null to an array instead of printing it out. Could someone please point me in the right direction?
You can use readarray command to populate a bash array after reading your file with gnu awk command with empty RS that lets awk split records on empty lines and using ORS as \0 (NUL) byte:
IFS= readarray -d '' arr < <(awk -v RS= -v ORS='\0' '1' file)
Check output:
echo "${arr[0]}"
Header 1
- Point 1
- Point 2
echo "${arr[1]}"
Header 2
- Point 1
- Point 2
echo "${arr[2]}"
Header 3
-Point 1
- Point 2
Online Demo

Pick 20 records each time and transpose from a big file

I have a big file with 1 column and 800,000 rows
Example:
123
234
...
5677
222
444
I want to transpose it into 20 numbers per line.
Example:
123,234,....
5677,
222,
444,....
I tried using while loop like this
while [ $(wc -l < list.dat) -ge 1 ]
do
cat list.dat | head -20 | awk -vORS=, '{ print $1 }'| sed 's/,$/\n/' >> sample1.dat
sed -i -e '1,20d' list.dat
done
but this is insanely slow.
Can anyone suggest a faster solution?
pr is the right tool for this, for example:
$ seq 100 | pr -20ats,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40
41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60
61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80
81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100
For your file, try pr -20ats, list.dat
Based on width of column text, you might run into the error pr: page width too narrow. In that case, try:
$ seq 100000 100100 | pr -40ats,
pr: page width too narrow
$ seq 100000 100100 | pr -J -W79 -40ats,
100000,100001,100002,100003,100004,100005,100006,100007,100008,100009,100010,100011,100012,100013,100014,100015,100016,100017,100018,100019,100020,100021,100022,100023,100024,100025,100026,100027,100028,100029,100030,100031,100032,100033,100034,100035,100036,100037,100038,100039
100040,100041,100042,100043,100044,100045,100046,100047,100048,100049,100050,100051,100052,100053,100054,100055,100056,100057,100058,100059,100060,100061,100062,100063,100064,100065,100066,100067,100068,100069,100070,100071,100072,100073,100074,100075,100076,100077,100078,100079
100080,100081,100082,100083,100084,100085,100086,100087,100088,100089,100090,100091,100092,100093,100094,100095,100096,100097,100098,100099,100100
Formula for -W value is (col-1)*len(delimiter) + col where col is number of columns required
From man pr
pr - convert text files for printing
-a, --across
print columns across rather than down, used together with -COLUMN
-t, --omit-header
omit page headers and trailers; implied if PAGE_LENGTH <= 10
-s[CHAR], --separator[=CHAR]
separate columns by a single character, default for CHAR is the character without -w and 'no
char' with -w. -s[CHAR] turns off line truncation of all 3 column options (-COLUMN|-a -COLUMN|-m)
except -w is set
-COLUMN, --columns=COLUMN
output COLUMN columns and print columns down, unless -a is used. Balance number of lines in the columns
on each page
-J, --join-lines
merge full lines, turns off -W line truncation, no column alignment, --sep-string[=STRING] sets separa‐
tors
-W, --page-width=PAGE_WIDTH
set page width to PAGE_WIDTH (72) characters always, truncate lines, except -J option is set, no inter‐
ference with -S or -s
See also Why is using a shell loop to process text considered bad practice?
If you don't wish to use any other external binaries, you can refer the below SO link answering a similar question in depth.
bash: combine five lines of input to each line of output
If you want to use sed:
sed -n '21~20 { x; s/^\n//; s/\n/, /g; p;}; 21~20! H;' list.dat
The first command
21~20 { x; s/^\n//; s/\n/, /g; p;},
is triggered at lines matching 21+(n*20); n>=0. Here everything that was put in the hold space at complement lines via the second command:
21~20! H;
is processed:
x;
puts the content of the hold buffer (20 lines) in the pattern space and places the current line (21+(n*20)) in the hold buffer. In the pattern space:
s/^\n//
removes the trailing new line and:
s/\n/, /g
does the desired substitution.:
p;
prints the now 20-columned row.
After that the next line is read in the hold buffer and the process continues.

Match two files by column line by line - no key

I have two large files of 80,000 plus records that are identical in length. I need to compare the two files line by line by the first 8 characters of the file. Line one of file one is to be compared to line one of file two. Line two of file one is to be compared to line two of file two.
Sample file1
01234567blah blah1
11234567blah blah2
21234567blah blah3
31234567blah blah4
Sample file2
31234567blah nomatch
11234567matchme2
21234567matchme3
31234567matchme4
Lines 2 - 4 should match but line 1 should not. My script matches line 1 to line 4 but should be compared to just line 1.
awk '
FNR==NR {
a[substr($0,1,8)]=1;next
}
{if (a[substr($0,1,8)])print $0; else print "Not Found", $0;}
' $inputfile1 $inputfile2 > $outputfile1
Thank you.
For line by line compare you need to use FNR variable as key. Try:
awk 'NR==FNR{a[FNR]=substr($1,1,8);next}{print (a[FNR]==substr($1,1,8)?$0:"Not Found")}' file1 file2
Not Found
11234567matchme2
21234567matchme3
31234567matchme4
awk 'BEGIN{
while(1){
f=getline<"file1";
if(f!=1)exit;
a=substr($0,1,8);
f=getline<"file2";
if(f!=1)exit;
b=substr($0,1,8);
print a==b?$0:"Not Found"FS$0}}'
Reads one line from file1 if successful stores the substring in a then one line from file2 if successful stores the substring in b, then checks whether a and b are equal or not and prints the output.
Output:
Not Found 31234567blah nomatch
11234567matchme2
21234567matchme3
31234567matchme4
If there's a single char not in either file you could use as a delimiter, like : in your example, and a paste/awk combo like:
paste -d: data data2 | awk -F: '{prefix=substr($1,1,8)!=substr($2,1,8) ? "Not Found"OFS : ""; print prefix $2}'
paste joins the corresponding lines from each file into one line, with a : separator
awk uses the : delimiter
awk tests for a match on the first 8 chars of each field and creates prefix
awk prints out every line with a prefix that's "Not Found" (+OFS) when they don't match.

Using Bash array in AWK

I have two files as follows:
file1:
3 1
2 4
2 1
file2:
23
9
7
45
The second field of file1 is used to specify the line of file2 that contains the number to be retrieved and printed. In the desired output, the first field of file1 is printed and then the retrieved field is printed.
Desired output file:
3 23
2 45
2 23
Here is my attempt to solve this problem:
IFS=$'\r\n' baf2=($(cat file2));echo;awk -v av="${baf2[*]}" 'BEGIN {split(av, aaf2, / /)}{print $1, aaf2[$2]}' file1;echo;echo ${baf2[*]}
However, this script cannot use the Bash array baf2.
The solution must be efficient since file1 has billions of lines and file2 has millions of lines in the real case.
This has a similar basis to Jotne's solution, but loads file2 into memory first (since it is smaller than file1):
awk 'FNR==NR{x[FNR]=$0;next}{print $1 FS x[$2]}' file2 file1
Explanation
The FNR==NR part means that the part that follows in curly braces is only executed when reading file2, not file1. As each line of file2 is read, it is saved in array x[] as indexed by the current line number. The part in the second set of curly braces is executed for every line of file1 and it prints the first field on the line followed by the field separator (space) followed by the entry in x[] as indexed by the second field on the line.
Using awk
1) print all lines in file1, whatever if there is match or not
awk 'NR==FNR{a[NR]=$1;next}{print $1,a[$2]}' file2 file1
2) print match lines only
awk 'NR==FNR{a[NR]=$1;next}$2=a[$2]' file2 file1
You can use this awk
awk 'FNR==NR {a[NR]=$1;next} {print $1,a[$2]}' file2 file1
3 23
2 45
2 23
Sorte file2 in array a.
Then print field 1 from file1 and use field 2 to look up in array.

Using awk to store values from column reads for multiple files

I am using cygwin on Windows 7. I have a directory with all text files and I want to loop through it and save the data from the second column of the first three rows for each of the file (1,2) (2,2) and (3,2).
So, the code would be something like
x1[0]=awk 'FNR == 1{print $2}'$file1
x1[1]=awk 'FNR == 2{print $2}'$file1
x1[2]=awk 'FNR == 3{print $2}'$file1
Then I want to use the divide by 100 of $x1 plus 1 to access data from other file and store it in the array. So that's:
let x1[0]=$x1[0]/100 + 1
let x1[1]=$(x1[1]/100)+1
let x1[2]=$(x1[2]/100)+1
read1=$(awk 'FNR == '$x1[0]' {print $1}' $file2)
read2=$(awk 'FNR == '$x1[1]' {print $1}' $file2)
read3=$(awk 'FNR == '$x1[2]' {print $1}' $file2)
Do the same thing for another file, except we don't need $x1 for this.
read4=$(awk 'FNR == 1{print $3,$4,$5,$6}' $file3)
Finally, just output all these values to a file i.e. read1-4
Need to do this in a loop for all the files in the folder, not quite sure how to go about that.The tricky part is that the filename of $file3 depends on the filename of $file1,
so if $file1 = abc123def.fna.map.txt
$file3 would be abc123def.fna
$file2 is hardcoded in it and stays the same for all the iterations.
file1 is a .txt file and a part of it looks like:
99 58900
16 59000
14 73000
file2 contains 600 lines of strings.
'Actinobacillus_pleuropneumoniae_L20'
'Actinobacillus_pleuropneumoniae_serovar_3_JL03'
'Actinobacillus_succinogenes_130Z'
'file3' is FASTA file and the first two lines look like this
>gi|94986445|ref|NC_008011.1| Lawsonia intracellularis PHE/MN1-00, complete genome
ATGAAGATCTTTTTATAGAGATAGTAATAAAAAAATGTCAGATAGATATACATTATAGTATAGTAGAGAA
The output can just write all the 4 reads to a random file or if possible can compare read1,read2,read3 and if it matches read4 i.e. the main name should match. In my example:
None of read1-3 match with Lawsonia intracellularis which is a part of read4. So it can just print success or failture to the file.
SAMPLE OUTPUT
Actinobacillus_pleuropneumoniae_L20
Actinobacillus_pleuropneumoniae_serovar_3_JL03
Actinobacillus_succinogenes_130Z
Lawsonia intracellularis
Failture
Sorry I was wrong about the 6 reads, just need 4 actually. Thanks for the help again.
This problem can be solved with TXR: http://www.nongnu.org/txr
Okay, I have these sample files (not your inputs, unfortunately):
$ ls -l
total 16
-rwxr-xr-x 1 kaz kaz 1537 2012-03-18 20:07 bac.txr # the program
-rw-r--r-- 1 kaz kaz 153 2012-03-18 19:16 foo.fna # file3: genome info
-rw-r--r-- 1 kaz kaz 24 2012-03-18 19:51 foo.fna.map.txt # file1
-rw-r--r-- 1 kaz kaz 160 2012-03-18 19:56 index.txt # file2: names of bacteria
$ cat index.txt
'Actinobacillus_pleuropneumoniae_L20'
'Actinobacillus_pleuropneumoniae_serovar_3_JL03'
'Lawsonia_intracellularis_PHE/MN1-00'
'Actinobacillus_succinogenes_130Z'
$ cat foo.fna.map.txt # note leading spaces: typo or real?
13 000
19 100
7 200
$ cat foo.fna
gi|94986445|ref|NC_008011.1| Lawsonia intracellularis PHE/MN1-00, complete genome
ATGAAGATCTTTTTATAGAGATAGTAATAAAAAAATGTCAGATAGATATACATTATAGTATAGTAGAGAA
As you can see, I cooked the data so there will be a match on the Lawsonia.
Run it:
$ ./bac.txr foo.fna.map.txt
Lawsonia intracellularis PHE/MN1-00 ATGAAGATCTTTTTATAGAGATAGTAATAAAAAAATGTCAGATAGATATACATTATAGTATAGTAGAGAA
Code follows. This is just a prototype; obviously it has to be developed and tested using the real data. I've made some guesses, like what the Lawsonia entry would look like in the index with the code attached to it.
#!/usr/local/bin/txr -f
#;;; collect the contents of the index fileo
#;;; into the list called index.
#;;; single quotes around lines are removed
#(block)
# (next "index.txt")
# (collect)
'#index'
# (end)
#(end)
#;;; filter underscores to spaces in the index
#(set index #(mapcar (op regsub #/_/ " ") index))
#;;; process files on the command line
#(next :args)
#(collect)
#;;; each command line argument has to match two patterns
#;;; #file1 takes the whole thing
#;;; #file3 matches the part before .map.txt
# (all)
#file1
# (and)
#file3.map.txt
# (end)
#;;; go into file 1 and collect second column material
#;;; over three lines into lineno list.
# (next file1)
# (collect :times 3)
#junk #lineno
# (end)
#;;; filter lineno list through a function which
#;;; converts to integer, divides by 100 and adds 1.
# (set lineno #(mapcar (op + 1 (trunc (int-str #1) 100))
lineno))
#;;; map the three line numbers to names through the
#;;; index, and bind these three names to variables
# (bind (name1 name2 name3) #(mapcar index lineno))
#;;; now go into file 3, and extract the name of the
#;;; bacterium there, and the genome from the 2nd line
# (next file3)
#a|#b|#c|#d| #name, complete genome
#genome
#;;; if the name matches one of the three names
#;;; then output the name and genome, otherwise
#;;; output failed
# (cases)
# (bind name (name1 name2 name3))
# (output)
#name #genome
# (end)
# (or)
# (output)
failed
# (end)
# (end)
#(end)

Resources