Disclaimers:
1) English is my second language, so please forgive any gramatical horrors you may find. I am pretty confident you will be able to understand what I need despite these.
2) I have found several examples in this site that address questions/problems similar to mine, though I was unfortunately not able to figure out the modifications that would need to be introduced to fit my needs.
3) You will find some text in capital letters here and there. Is is of course not me "shouting" at you, but only a way to make portions of text stand out. Plase do not consider this an act of unpoliteness.
4) For those of you who get to the bottom of this novella alive, THANKS IN ADVANCE for your patience, even if you do not get to be able to/feel like help/ing me. My disclamer here would be the fact that, after surfing the site for a while, I noticed that the most common "complaint" from people willing to help seems to be lack of information (and/or the lack of quality) provided by the ones seeking for help. I then preferred to be accused of overwording if need be... It would be, at least, not a common offense...
The "Problem":
I have 2 files (a and b for simplification). File a has 7 columns separated by commas. File b has 2 columns separated by commas.
What I need: Whenever the data in the 7th column of file a matches -EXACT MATCHES ONLY- the data on the 1st column of file b, a new line, containing the whole line of file a plus column 2 of file b is to be appended into a new file "c".
--- MORE INFO IN THE NOTES AT THE BOTTOM ---
file a:
Server Name,File System,Path,File,Date,Type,ID
horror,/tmp,foldera/folder/b/folderc,binaryfile.bin,2014-01-21 22:21:59.000000,typet,aaaaaaaa
host1,/,somefolder,test1.txt,2016-08-18 00:00:20.000000,typez,11111111
host20,/,somefolder/somesubfolder,usr.cfg,2015-12-288 05:00:20.000000,typen,22222222
hoster,/lol,foolie,anotherfile.sad,2014-01-21 22:21:59.000000,typelol,66666666
hostie,/,someotherfolder,somefile.txt,2016-06-17 18:43:12.000000,typea,33333333
hostile,/sad,folder22,higefile.hug,2016-06-17 18:43:12.000000,typeasd,77777777
hostin,/var,folder30,someotherfile.cfg,2014-01-21 22:21:59.000000,typo,44444444
hostn,/usr,foldie,tinyfile.lol,2016-08-18 00:00:20.000000,typewhatever,55555555
server10,/usr,foldern,tempfile.tmp,2016-06-17 18:43:12.000000,tipesad,99999999
file b:
ID,Size
11111111,215915
22222222,1716
33333333,212856
44444444,1729
55555555,215927
66666666,1728
88888888,1729
99999999,213876
bbbbbbbb,26669080
Expected file c:
Server Name,File System,Path,File,Date,Type,ID,Size
host1,/,somefolder,test1.txt,2016-08-18 00:00:20.000000,typez,11111111,215915
host20,/,somefolder/somesubfolder,usr.cfg,2015-12-288 05:00:20.000000,typen,22222222,1716
hoster,/lol,foolie,anotherfile.sad,2014-01-21 22:21:59.000000,typelol,66666666,1728
hostie,/,someotherfolder,somefile.txt,2016-06-17 18:43:12.000000,typea,33333333,212856
hostin,/var,folder30,someotherfile.cfg,2014-01-21 22:21:59.000000,typo,44444444,1729
hostn,/usr,foldie,tinyfile.lol,2016-08-18 00:00:20.000000,typewhatever,55555555,215927
server10,/usr,foldern,tempfile.tmp,2016-06-17 18:43:12.000000,tipesad,99999999,213876
Additional notes:
0) Notice how line with ID "aaaaaaaa" in file a does not make it into file c since ID "aaaaaaaa" is not present in file b. Likewise, line with ID "bbbbbbbb" in file b does not make it into file c since ID "bbbbbbbb" is not present in file a and it is therefore never looked out for in the first place.
1) Data is clearly completely made out due to confidenciality issues, though the examples provided fairly resemble what the real files look like.
2) I added headers just to provide a better idea of the nature of the data. The real files don't have it, so no need to skip them on the source file nor create it in the destination file.
3) Both files come sorted by default, meaning that IDs will be properly sorted in file b, while they will be most likely scrambled in file a. File c should preferably follow the order of file a (though I can manipulate later to fit my needs anyway, so no worries there, as long as the code does what I need and doesn't mess up with the data by combining the wrong lines).
4) VERY VERY VERY IMPORTANT:
4.a) I already have a "working" ksh code (attached below) that uses "cat", "grep", "while" and "if" to do the job. It worked like a charm (well, acceptably) with 160K-lines sample files (it was able to output 60K lines -approx- an hour, which, in projection, would yield an acceptable "20 days" to produce 30 million lines [KEEP ON READING]), but somehow (I have plenty of processor and memory capacity) cat and/or grep seem to be struggling to process a real life 5Million-lines file (both file a and b can have up to 30 million lines each, so that's the maximum probable amount of lines in the resulting file, even assuming 100% lines in file a find it's match in file b) and the c file is now only being feed with a couple hundred lines every 24 hours.
4.b) I was told that awk, being stronger, should succeed where the more weaker commands I worked with seem to fail. I was also told that working with arrays might be the solution to my performance problem, since all data is uploded to memory at once and worked from there, instead of having to cat | grep file b as many times as there are lines in file a, as I am currently doing.
4.c) I am working on AIX, so I only have sh and ksh, no bash, therefore I cannot use the array tools provided by the latter, that's why I thought of AWK, that and the fact that I think AWK is probably "stronger", though I might be (probably?) wrong.
Now, I present to you the magnificent piece of ksh code (obvious sarcasm here, though I like the idea of you picturing for a brief moment in your mind the image of the monkey holding up and showing all other jungle-crawlers their future lion king) I have managed to develop (feel free to laugh as hard as you need while reading this code, I will not be able to hear you anyway, so no feelings harmed :P ):
cat "${file_a}" | while read -r line_file_a; do
server_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $1}'`
filespace_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $2}'`
folder_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $3}'`
file_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $4}'`
file_date_file_a=`echo "${line_file_a}" | awk -F"," '{print $5}'`
file_type_file_a=`echo "${line_file_a}" | awk -F"," '{print $6}'`
file_id_file_a=`echo "${line_file_a}" | awk -F"," '{print $7}'`
cat "${file_b}" | grep ${object_id_file_a} | while read -r line_file_b; do
file_id_file_b=`echo "${line_file_b}" | awk -F"," '{print $1}'`
file_size_file_b=`echo "${line_file_b}" | awk -F"," '{print $2}'`
if [ "${file_id_file_a}" = "${file_id_file_b}" ]; then
echo "${server_name_file_a},${filespace_name_file_a},${folder_name_file_a},${file_name_file_a},${file_date_file_a},${file_type_file_a},${file_id_file_a},${file_size_file_b}" >> ${file_c}.csv
fi
done
done
One last additional note, just in case you wonder:
The "if" section was not only built as a mean to articulate the output line, but it servers a double purpose, while safe-proofing any false positives that may derive from grep, IE 100 matching 1000 (Bear in mind that, as I mentioned earlier, I am working on AIX, so my grep does not have the -m switch the GNU one has, and I need matches to be exact/absolute).
You have reached the end. CONGRATULATIONS! You've been awarded the medal to patience.
$ cat stuff.awk
BEGIN { FS=OFS="," }
NR == FNR { a[$1] = $2; next }
$7 in a { print $0, a[$7] }
Note the order for providing the files to the awk command, b first, followed by a:
$ awk -f stuff.awk b.txt a.txt
host1,/,somefolder,test1.txt,2016-08-18 00:00:20.000000,typez,11111111,215915
host20,/,somefolder/somesubfolder,usr.cfg,2015-12-288 05:00:20.000000,typen,22222222,1716
hoster,/lol,foolie,anotherfile.sad,2014-01-21 22:21:59.000000,typelol,66666666,1728
hostie,/,someotherfolder,somefile.txt,2016-06-17 18:43:12.000000,typea,33333333,212856
hostin,/var,folder30,someotherfile.cfg,2014-01-21 22:21:59.000000,typo,44444444,1729
hostn,/usr,foldie,tinyfile.lol,2016-08-18 00:00:20.000000,typewhatever,55555555,215927
server10,/usr,foldern,tempfile.tmp,2016-06-17 18:43:12.000000,tipesad,99999999,213876
EDIT: Updated calculation
You can try to predict how often you are calling another program:
At least 7 awk's + 1 cat + 1 grep for each line in file a multiplied by 2 awk's for each line in file b.
(9 * 160.000).
For file b: 2 awk's, one file open and one file close for each hit. With 60K output, that would be 4 * 60.000.
A small change in the code can change this into "only" 160.000 times a grep:
cat "${file_a}" | while IFS=, read -r server_name_file_a \
filespace_name_file_a folder_name_file_a file_name_file_a \
file_date_file_a file_type_file_a file_id_file_a; do
grep "${object_id_file_a}" "${file_b}" | while IFS="," read -r line_file_b; do
if [ "${file_id_file_a}" = "${file_id_file_b}" ]; then
echo "${server_name_file_a},${filespace_name_file_a},${folder_name_file_a},${file_name_file_a},${file_date_file_a},${file_type_file_a},${file_id_file_a},${file_size_file_b}"
fi
done
done >> ${file_c}.csv
Well, try this with your 160K files and see how much faster it is.
Before I explain that this still is the wrong way I will make another small improvement: I will move the cat for the while loop to the end (after done).
while IFS=, read -r server_name_file_a \
filespace_name_file_a folder_name_file_a file_name_file_a \
file_date_file_a file_type_file_a file_id_file_a; do
grep "${object_id_file_a}" "${file_b}" | while IFS="," read -r line_file_b; do
if [ "${file_id_file_a}" = "${file_id_file_b}" ]; then
echo "${server_name_file_a},${filespace_name_file_a},${folder_name_file_a},${file_name_file_a},${file_date_file_a},${file_type_file_a},${file_id_file_a},${file_size_file_b}"
fi
done
done < "${file_a}" >> ${file_c}.csv
The main drawback of the solutions is that you are reading the complete file_b again and again with your grep for each line in file a.
This solution is a nice improvement in the performance, but still a lot overhead with grep. Another huge improvement can be found with awk.
The best solution is using awk as explained in What is "NR==FNR" in awk? and found in the answer of #jas.
It is only one system call and both files are only read once.
I don't know if this is even possible in command line, but anyway, here is what I want to do:
I have a text file written like that
- FileName1.txt
http://example.com/AnyName-For-File-1.txt
- FileName2.txt
- FileName3.txt
- FileName4.txt
http://example.com/AnyName-For-File-4.txt
- FileName5.txt
http://example.com/AnyName-For-File-5.txt
As you can see, the text was written randomly (somehow), which means that some files have an address, and some of them don't, so I can't Apply any rule on these lines like arranging\sorting and so ever, or I'm gonna lose the files "Names,Addresses" arrangement.
So, first I had to Move All of the addresses lines, one step up (that was the easy part in GUI), and I was able to do it using Np++/TextPad Regex as follow:- Find:\nhttp - Replace:http , The final result was like this:
Step.1:-
- FileName1.txt http://example.com/AnyName-For-File-1.txt
- FileName2.txt
- FileName3.txt
- FileName4.txt http://example.com/AnyName-For-File-4.txt
- FileName5.txt http://example.com/AnyName-For-File-5.txt
Now, The worst part (at least for me) is to move the matching pattern to the beginning of their lines, Exactly Like This:
Step.2:-
http://example.com/AnyName-For-File-1.txt- FileName1.txt
- FileName2.txt
- FileName3.txt
http://example.com/AnyName-For-File-4.txt- FileName4.txt
http://example.com/AnyName-For-File-5.txt- FileName5.txt
and now I can easily sort them, or whatever I need without any risk.
So, my question is:-
In Command Line CMD or Cygwin :-
1- How to Find "\nhttp" , and Replace with " http" ?
2- How to Move The Matching Patterns (File Address, From http to .txt), to the beginning of their Lines ?
also if there is any other technique, it would be great to know it.
Thanks a lot guys for the help you're offering, in such a great community. I really appreciate your help :)
Here is an awk command which, I think, does what you want:
$ awk '/^http/{print $0 last;last="";next} last {print last} {last=$0} END{if (last) print last;}' file2
http://example.com/AnyName-For-File-1.txt- FileName1.txt
- FileName2.txt
- FileName3.txt
http://example.com/AnyName-For-File-4.txt- FileName4.txt
http://example.com/AnyName-For-File-5.txt- FileName5.txt
How it works
The script has one variable, last which contains the contents of the previous line. awk implicitly loops over every line in the input file
/^http/{print $0 last;last="";next}
If the current line starts with http, then print it and the previous line together. Set last to empty and skip the remaining commands and jump to the next line.
last {print last}
If the last variable is not empty, print it. This only happens if there was no URL to go with the last line.
{last=$0}
Update the last variable with the current line. In awk, $0 denotes the whole of the current line.
END{if (last) print last;}
At the end of the input, if there is still a line in last, print it. This only happens if the last line was a file name which lacked a URL.
Doing just the first step in sed
As long as file is not too big, this will work:
$ sed ':a;N;$!b a;s/\nhttp/ http/g' file
- FileName1.txt http://example.com/AnyName-For-File-1.txt
- FileName2.txt
- FileName3.txt
- FileName4.txt http://example.com/AnyName-For-File-4.txt
- FileName5.txt http://example.com/AnyName-For-File-5.txt
This works by reading the entire file into sed's pattern space and then substituting to replace \nhttp with http.
In more detail:
:a;N;$!b a
This is a loop. :a is a label. N reads the next line into the pattern space. b a jumps to label :a. We want to continue this loop until the end of the file. The last line in the file is called $ and ! means not. So, $!b a means jump to label :a unless we have reached the last line of the file.
s/\nhttp/ http/g
Now that we have the whole of the file in the pattern space, we do a global substitution replacing \nhttp with http.
This is a variation on the above. It reads lines into the pattern space until it reaches a line that starts with http. Then, it removes the newline from in front of that line:
$ sed ':a;N;/http/!b a; s/\nhttp/ http/' file
- FileName1.txt http://example.com/AnyName-For-File-1.txt
- FileName2.txt
- FileName3.txt
- FileName4.txt http://example.com/AnyName-For-File-4.txt
- FileName5.txt http://example.com/AnyName-For-File-5.txt
Since this approach doesn't read the whole file in at once, it is easier on memory if the file is large.
In more detail:
:a;N;/http/!b a
Just as above, this is a loop. It keeps branching back to label :a to read another line until we get a line that includes http.
s/\nhttp/ http/
This replaces the newline in front of http with a space.
This might work for you (GNU sed):
sed -r 'N;s/(^-.*)\n(http.*)/\2\1/;P;D' file
Read two lines at a time and swap line 2 for line 1 (removing the newline) if the pattern matches. Those lines that do not match are printed as is.
This short Perl program will do as you ask.
Be careful to backup your original file, as it modifies the file in-place.
The path to the file to be edited is passed as a parameter on the command line, like this
perl edit_file.pl mytext.txt
use strict;
use warnings;
use Tie::File;
tie my #file, 'Tie::File', shift or die $!;
for ( my $i = 1; $i < #file; ) {
if ( $file[$i] =~ m<^http://>i ) {
$file[$i] .= $file[$i-1];
splice #file, $i-1, 1;
next;
}
++$i;
}
result
http://example.com/AnyName-For-File-1.txt- FileName1.txt
- FileName2.txt
- FileName3.txt
http://example.com/AnyName-For-File-4.txt- FileName4.txt
http://example.com/AnyName-For-File-5.txt- FileName5.txt
Noble StackOverflow readers,
I have a comma seperated file, each line of which I am putting into an array.
Data looks as so...
25455410,GROU,AJAXa,GROU1435804437
25455410,AING,EXS3d,AING4746464646
25455413,TRAD,DLGl,TRAD7176202067
There are 103 lines and I am able to generate the 103 arrays without issue.
n=1; while read -r OrdLine; do
IFS=',' read -a OrdLineArr${n} <<< "$OrdLine"
let n++
done < $WkOrdsFile
HOWEVER, I can only access the arrays as so...
echo "${OrdLineArr3[0]} <---Gives 25455413
I cannot access it with the number 1-103 as a variable - for example the following doesn't work...
i=3
echo "${OrdLineArr${i}[0]}
That results in...
./script2.sh: line 24: ${OrdLineArr${i}[0]}: bad substitution
I think that the answer might involve 'eval' but I cannot seem to find a fitting example to borrow. If somebody can fix this then the above code makes for a very easy to handle 2d array replacement in bash!
Thanks so much for you help in advance!
Dan
You can use indirect expansion. For example, if $key is OrdLineArr4[7], then ${!key} (with an exclamation point) means ${OrdLineArr4[7]}. (See §3.5.3 "Shell Parameter Expansion" in the Bash Reference Manual, though admittedly that passage doesn't really explain how indirect expansion interacts with arrays.)
I'd recommend wrapping this in a function:
function OrdLineArr () {
local -i i="$1" # line number (1-103)
local -i j="$2" # field number (0-3)
local key="OrdLineArr$i[$j]"
echo "${!key}"
}
Then you can write:
echo "$(OrdLineArr 3 0)" # prints 25455413
i=3
echo "$(OrdLineArr $i 0)" # prints 25455413
This obviously isn't a total replacement for two-dimensional arrays, but it will accomplish what you need. Without using eval.
eval is usually a bad idea, but you can do it with:
eval echo "\${OrdLineArr$i[0]}"
I would store each line in an array, but split it on demand:
readarray OrdLineArr < $WkOrdsFile
...
OrdLine=${OrdLineArr[i]}
IFS=, read -a Ord <<< "$OrdLine"
However, bash isn't really equipped for data processing; it's designed to facilitate process and file management. You should consider using a different language.