Multiple pipes in C - c

I want to implement multi pipes in c so I can do something like this, where ||| means duplicate the stdin to N pipe commands):
cat /tmp/test.log ||| wc -l ||| grep test1 ||| grep test2 | grep test3
This will return to me the number of lines in the file and the lines in the file that contain 'test1' string and the lines in the file that contain 'test2' && 'test3' string
In other words this would have the effect of these 3 regular pipelines:
cat /tmp/test.log | wc -l --> stdout
| grep test1 --> stdout
| grep test2 | grep test3 --> stdout
Has someone already implementated something like this? I didn't find anything...
NOTE: I know it can be done with scripting languages or with bash multiple file descriptors, but I am searching C code to do it.
Thanks!

Maybe your should start off with the tee command and examine their code.

Because it is impossible in C to have more than one process (or thread) read the same file descriptor without draining the data read, all solutions will have to store the data read in a temporary file and then each read the temp file.

Related

Vlookup-like function using awk in ksh

Disclaimers:
1) English is my second language, so please forgive any gramatical horrors you may find. I am pretty confident you will be able to understand what I need despite these.
2) I have found several examples in this site that address questions/problems similar to mine, though I was unfortunately not able to figure out the modifications that would need to be introduced to fit my needs.
3) You will find some text in capital letters here and there. Is is of course not me "shouting" at you, but only a way to make portions of text stand out. Plase do not consider this an act of unpoliteness.
4) For those of you who get to the bottom of this novella alive, THANKS IN ADVANCE for your patience, even if you do not get to be able to/feel like help/ing me. My disclamer here would be the fact that, after surfing the site for a while, I noticed that the most common "complaint" from people willing to help seems to be lack of information (and/or the lack of quality) provided by the ones seeking for help. I then preferred to be accused of overwording if need be... It would be, at least, not a common offense...
The "Problem":
I have 2 files (a and b for simplification). File a has 7 columns separated by commas. File b has 2 columns separated by commas.
What I need: Whenever the data in the 7th column of file a matches -EXACT MATCHES ONLY- the data on the 1st column of file b, a new line, containing the whole line of file a plus column 2 of file b is to be appended into a new file "c".
--- MORE INFO IN THE NOTES AT THE BOTTOM ---
file a:
Server Name,File System,Path,File,Date,Type,ID
horror,/tmp,foldera/folder/b/folderc,binaryfile.bin,2014-01-21 22:21:59.000000,typet,aaaaaaaa
host1,/,somefolder,test1.txt,2016-08-18 00:00:20.000000,typez,11111111
host20,/,somefolder/somesubfolder,usr.cfg,2015-12-288 05:00:20.000000,typen,22222222
hoster,/lol,foolie,anotherfile.sad,2014-01-21 22:21:59.000000,typelol,66666666
hostie,/,someotherfolder,somefile.txt,2016-06-17 18:43:12.000000,typea,33333333
hostile,/sad,folder22,higefile.hug,2016-06-17 18:43:12.000000,typeasd,77777777
hostin,/var,folder30,someotherfile.cfg,2014-01-21 22:21:59.000000,typo,44444444
hostn,/usr,foldie,tinyfile.lol,2016-08-18 00:00:20.000000,typewhatever,55555555
server10,/usr,foldern,tempfile.tmp,2016-06-17 18:43:12.000000,tipesad,99999999
file b:
ID,Size
11111111,215915
22222222,1716
33333333,212856
44444444,1729
55555555,215927
66666666,1728
88888888,1729
99999999,213876
bbbbbbbb,26669080
Expected file c:
Server Name,File System,Path,File,Date,Type,ID,Size
host1,/,somefolder,test1.txt,2016-08-18 00:00:20.000000,typez,11111111,215915
host20,/,somefolder/somesubfolder,usr.cfg,2015-12-288 05:00:20.000000,typen,22222222,1716
hoster,/lol,foolie,anotherfile.sad,2014-01-21 22:21:59.000000,typelol,66666666,1728
hostie,/,someotherfolder,somefile.txt,2016-06-17 18:43:12.000000,typea,33333333,212856
hostin,/var,folder30,someotherfile.cfg,2014-01-21 22:21:59.000000,typo,44444444,1729
hostn,/usr,foldie,tinyfile.lol,2016-08-18 00:00:20.000000,typewhatever,55555555,215927
server10,/usr,foldern,tempfile.tmp,2016-06-17 18:43:12.000000,tipesad,99999999,213876
Additional notes:
0) Notice how line with ID "aaaaaaaa" in file a does not make it into file c since ID "aaaaaaaa" is not present in file b. Likewise, line with ID "bbbbbbbb" in file b does not make it into file c since ID "bbbbbbbb" is not present in file a and it is therefore never looked out for in the first place.
1) Data is clearly completely made out due to confidenciality issues, though the examples provided fairly resemble what the real files look like.
2) I added headers just to provide a better idea of the nature of the data. The real files don't have it, so no need to skip them on the source file nor create it in the destination file.
3) Both files come sorted by default, meaning that IDs will be properly sorted in file b, while they will be most likely scrambled in file a. File c should preferably follow the order of file a (though I can manipulate later to fit my needs anyway, so no worries there, as long as the code does what I need and doesn't mess up with the data by combining the wrong lines).
4) VERY VERY VERY IMPORTANT:
4.a) I already have a "working" ksh code (attached below) that uses "cat", "grep", "while" and "if" to do the job. It worked like a charm (well, acceptably) with 160K-lines sample files (it was able to output 60K lines -approx- an hour, which, in projection, would yield an acceptable "20 days" to produce 30 million lines [KEEP ON READING]), but somehow (I have plenty of processor and memory capacity) cat and/or grep seem to be struggling to process a real life 5Million-lines file (both file a and b can have up to 30 million lines each, so that's the maximum probable amount of lines in the resulting file, even assuming 100% lines in file a find it's match in file b) and the c file is now only being feed with a couple hundred lines every 24 hours.
4.b) I was told that awk, being stronger, should succeed where the more weaker commands I worked with seem to fail. I was also told that working with arrays might be the solution to my performance problem, since all data is uploded to memory at once and worked from there, instead of having to cat | grep file b as many times as there are lines in file a, as I am currently doing.
4.c) I am working on AIX, so I only have sh and ksh, no bash, therefore I cannot use the array tools provided by the latter, that's why I thought of AWK, that and the fact that I think AWK is probably "stronger", though I might be (probably?) wrong.
Now, I present to you the magnificent piece of ksh code (obvious sarcasm here, though I like the idea of you picturing for a brief moment in your mind the image of the monkey holding up and showing all other jungle-crawlers their future lion king) I have managed to develop (feel free to laugh as hard as you need while reading this code, I will not be able to hear you anyway, so no feelings harmed :P ):
cat "${file_a}" | while read -r line_file_a; do
server_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $1}'`
filespace_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $2}'`
folder_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $3}'`
file_name_file_a=`echo "${line_file_a}" | awk -F"," '{print $4}'`
file_date_file_a=`echo "${line_file_a}" | awk -F"," '{print $5}'`
file_type_file_a=`echo "${line_file_a}" | awk -F"," '{print $6}'`
file_id_file_a=`echo "${line_file_a}" | awk -F"," '{print $7}'`
cat "${file_b}" | grep ${object_id_file_a} | while read -r line_file_b; do
file_id_file_b=`echo "${line_file_b}" | awk -F"," '{print $1}'`
file_size_file_b=`echo "${line_file_b}" | awk -F"," '{print $2}'`
if [ "${file_id_file_a}" = "${file_id_file_b}" ]; then
echo "${server_name_file_a},${filespace_name_file_a},${folder_name_file_a},${file_name_file_a},${file_date_file_a},${file_type_file_a},${file_id_file_a},${file_size_file_b}" >> ${file_c}.csv
fi
done
done
One last additional note, just in case you wonder:
The "if" section was not only built as a mean to articulate the output line, but it servers a double purpose, while safe-proofing any false positives that may derive from grep, IE 100 matching 1000 (Bear in mind that, as I mentioned earlier, I am working on AIX, so my grep does not have the -m switch the GNU one has, and I need matches to be exact/absolute).
You have reached the end. CONGRATULATIONS! You've been awarded the medal to patience.
$ cat stuff.awk
BEGIN { FS=OFS="," }
NR == FNR { a[$1] = $2; next }
$7 in a { print $0, a[$7] }
Note the order for providing the files to the awk command, b first, followed by a:
$ awk -f stuff.awk b.txt a.txt
host1,/,somefolder,test1.txt,2016-08-18 00:00:20.000000,typez,11111111,215915
host20,/,somefolder/somesubfolder,usr.cfg,2015-12-288 05:00:20.000000,typen,22222222,1716
hoster,/lol,foolie,anotherfile.sad,2014-01-21 22:21:59.000000,typelol,66666666,1728
hostie,/,someotherfolder,somefile.txt,2016-06-17 18:43:12.000000,typea,33333333,212856
hostin,/var,folder30,someotherfile.cfg,2014-01-21 22:21:59.000000,typo,44444444,1729
hostn,/usr,foldie,tinyfile.lol,2016-08-18 00:00:20.000000,typewhatever,55555555,215927
server10,/usr,foldern,tempfile.tmp,2016-06-17 18:43:12.000000,tipesad,99999999,213876
EDIT: Updated calculation
You can try to predict how often you are calling another program:
At least 7 awk's + 1 cat + 1 grep for each line in file a multiplied by 2 awk's for each line in file b.
(9 * 160.000).
For file b: 2 awk's, one file open and one file close for each hit. With 60K output, that would be 4 * 60.000.
A small change in the code can change this into "only" 160.000 times a grep:
cat "${file_a}" | while IFS=, read -r server_name_file_a \
filespace_name_file_a folder_name_file_a file_name_file_a \
file_date_file_a file_type_file_a file_id_file_a; do
grep "${object_id_file_a}" "${file_b}" | while IFS="," read -r line_file_b; do
if [ "${file_id_file_a}" = "${file_id_file_b}" ]; then
echo "${server_name_file_a},${filespace_name_file_a},${folder_name_file_a},${file_name_file_a},${file_date_file_a},${file_type_file_a},${file_id_file_a},${file_size_file_b}"
fi
done
done >> ${file_c}.csv
Well, try this with your 160K files and see how much faster it is.
Before I explain that this still is the wrong way I will make another small improvement: I will move the cat for the while loop to the end (after done).
while IFS=, read -r server_name_file_a \
filespace_name_file_a folder_name_file_a file_name_file_a \
file_date_file_a file_type_file_a file_id_file_a; do
grep "${object_id_file_a}" "${file_b}" | while IFS="," read -r line_file_b; do
if [ "${file_id_file_a}" = "${file_id_file_b}" ]; then
echo "${server_name_file_a},${filespace_name_file_a},${folder_name_file_a},${file_name_file_a},${file_date_file_a},${file_type_file_a},${file_id_file_a},${file_size_file_b}"
fi
done
done < "${file_a}" >> ${file_c}.csv
The main drawback of the solutions is that you are reading the complete file_b again and again with your grep for each line in file a.
This solution is a nice improvement in the performance, but still a lot overhead with grep. Another huge improvement can be found with awk.
The best solution is using awk as explained in What is "NR==FNR" in awk? and found in the answer of #jas.
It is only one system call and both files are only read once.

Creating variables from command line in BASH

I'm a very new user to bash, so bear with me.
I'm trying to run a bash script that will take inputs from the command line, and then run a c program which its output to other c programs.
For example, at command line I would enter as follows:
$ ./script.sh -flag file1 < file2
The within the script I would have:
./c_program -flag file1 < file2 | other_c_program
The problem is, -flag, file1 and file2 need to be variable.
Now I know that for -flag and file this is fairly simple- I can do
FLAG=$1
FILE1=$2
./c_program $FLAG FILE1
My problem is: is there a way to assign a variable within the script to file2?
EDIT:
It's a requirement of the program that the script is called as
$ ./script.sh -flag file1 < file2
You can simply run ./c_program exactly as you have it, and it will inherit stdin from the parent script. It will read from wherever its parent process is reading from.
FLAG=$1
FILE1=$2
./c_program "$FLAG" "$FILE1" | other_c_program # `< file2' is implicit
Also, it's a good idea to quote variable expansions. That way if $FILE1 contains whitespace or other tricky characters the script will still work.
There is no simple way to do what you are asking. This is because when you run this:
$ ./script.sh -flag file1 < file2
The shell which interprets the command will open file2 for reading and pipe its contents to script.sh. Your script will never know what the file's name was, therefore cannot store that name as a variable. However, you could invoke your script this way:
$ ./script.sh -flag file1 file2
Then it is quite straightforward--you already know how to get file1, and file2 is the same.

How to find duplicate lines across 2 different files? Unix

From the unix terminal, we can use diff file1 file2 to find the difference between two files. Is there a similar command to show the similarity across 2 files? (many pipes allowed if necessary.
Each file contains a line with a string sentence; they are sorted and duplicate lines removed with sort file1 | uniq.
file1: http://pastebin.com/taRcegVn
file2: http://pastebin.com/2fXeMrHQ
And the output should output the lines that appears in both files.
output: http://pastebin.com/FnjXFshs
I am able to use python to do it as such but i think it's a little too much to put into the terminal:
x = set([i.strip() for i in open('wn-rb.dic')])
y = set([i.strip() for i in open('wn-s.dic')])
z = x.intersection(y)
outfile = open('reverse-diff.out')
for i in z:
print>>outfile, i
If you want to get a list of repeated lines without resorting to AWK, you can use -d flag to uniq:
sort file1 file2 | uniq -d
As #tjameson mentioned it may be solved in another thread.
Just would like to post another solution:
sort file1 file2 | awk 'dup[$0]++ == 1'
refer to awk guide to get some awk
basics, when the pattern value of a line is true this line will be
printed
dup[$0] is a hash table in which each key is each line of the input,
the original value is 0 and increments once this line occurs, when
it occurs again the value should be 1, so dup[$0]++ == 1 is true.
Then this line is printed.
Note that this only works when there are not duplicates in either file, as was specified in the question.

Faster grep function for big (27GB) files

I have to grep from a file (5MB) containing specific strings the same strings (and other information) from a big file (27GB).
To speed up the analysis I split the 27GB file into 1GB files and then applied the following script (with the help of some people here). However it is not very efficient (to produce a 180KB file it takes 30 hours!).
Here's the script. Is there a more appropriate tool than grep? Or a more efficient way to use grep?
#!/bin/bash
NR_CPUS=4
count=0
for z in `echo {a..z}` ;
do
for x in `echo {a..z}` ;
do
for y in `echo {a..z}` ;
do
for ids in $(cat input.sam|awk '{print $1}');
do
grep $ids sample_"$z""$x""$y"|awk '{print $1" "$10" "$11}' >> output.txt &
let count+=1
[[ $((count%NR_CPUS)) -eq 0 ]] && wait
done
done #&
A few things you can try:
1) You are reading input.sam multiple times. It only needs to be read once before your first loop starts. Save the ids to a temporary file which will be read by grep.
2) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8. This will speed up grep.
3) Use fgrep because you're searching for a fixed string, not a regular expression.
4) Use -f to make grep read patterns from a file, rather than using a loop.
5) Don't write to the output file from multiple processes as you may end up with lines interleaving and a corrupt file.
After making those changes, this is what your script would become:
awk '{print $1}' input.sam > idsFile.txt
for z in {a..z}
do
for x in {a..z}
do
for y in {a..z}
do
LC_ALL=C fgrep -f idsFile.txt sample_"$z""$x""$y" | awk '{print $1,$10,$11}'
done >> output.txt
Also, check out GNU Parallel which is designed to help you run jobs in parallel.
My initial thoughts are that you're repeatedly spawning grep. Spawning processes is very expensive (relatively) and I think you'd be better off with some sort of scripted solution (e.g. Perl) that doesn't require the continual process creation
e.g. for each inner loop you're kicking off cat and awk (you won't need cat since awk can read files, and in fact doesn't this cat/awk combination return the same thing each time?) and then grep. Then you wait for 4 greps to finish and you go around again.
If you have to use grep, you can use
grep -f filename
to specify the set of patterns to match in the filename, rather than a single pattern on the command line. I suspect form the above you can pre-generate such a list.
ok I have a test file containing 4 character strings ie aaaa aaab aaac etc
ls -lh test.txt
-rw-r--r-- 1 root pete 1.9G Jan 30 11:55 test.txt
time grep -e aaa -e bbb test.txt
<output>
real 0m19.250s
user 0m8.578s
sys 0m1.254s
time grep --mmap -e aaa -e bbb test.txt
<output>
real 0m18.087s
user 0m8.709s
sys 0m1.198s
So using the mmap option shows a clear improvement on a 2 GB file with two search patterns, if you take #BrianAgnew's advice and use a single invocation of grep try the --mmap option.
Though it should be noted that mmap can be a bit quirky if the source files changes during the search.
from man grep
--mmap
If possible, use the mmap(2) system call to read input, instead of the default read(2) system call. In some situations, --mmap yields better performance. However, --mmap can cause undefined behavior (including core dumps) if an input file shrinks while grep is operating, or if an I/O error occurs.
Using GNU Parallel it would look like this:
awk '{print $1}' input.sam > idsFile.txt
doit() {
LC_ALL=C fgrep -f idsFile.txt sample_"$1" | awk '{print $1,$10,$11}'
}
export -f doit
parallel doit {1}{2}{3} ::: {a..z} ::: {a..z} ::: {a..z} > output.txt
If the order of the lines is not important this will be a bit faster:
parallel --line-buffer doit {1}{2}{3} ::: {a..z} ::: {a..z} ::: {a..z} > output.txt

Bash executing a subset of lines in script

I have a file which contains commands similar to:
cat /home/ptay89/test/01.out
cat /home/ptay89/testing/02.out
...
But I only want a few of them executing. For example, if I only want to see the output files ending in 1.out, I can do this:
cat commands | grep 1.out | sh
However, I get the following output for each of the lines in the commands file:
: cannot be loaded - no such file or directoryst/01.out
When I copy and past the commands I want from the file directly, it works fine. Are there better ways of doing this?
You probably have spurious carriage returns in your file (created under Windows?). Use tr instead of cat to remove them:
tr -d '\015' <commands | grep 1.out | sh
Try doing a
grep -e '^cat.*out' commands | grep 1.out | sh
That should ignore any weird characters and take only the ones you need.

Resources