Faster grep function for big (27GB) files - file

I have to grep from a file (5MB) containing specific strings the same strings (and other information) from a big file (27GB).
To speed up the analysis I split the 27GB file into 1GB files and then applied the following script (with the help of some people here). However it is not very efficient (to produce a 180KB file it takes 30 hours!).
Here's the script. Is there a more appropriate tool than grep? Or a more efficient way to use grep?
#!/bin/bash
NR_CPUS=4
count=0
for z in `echo {a..z}` ;
do
for x in `echo {a..z}` ;
do
for y in `echo {a..z}` ;
do
for ids in $(cat input.sam|awk '{print $1}');
do
grep $ids sample_"$z""$x""$y"|awk '{print $1" "$10" "$11}' >> output.txt &
let count+=1
[[ $((count%NR_CPUS)) -eq 0 ]] && wait
done
done #&

A few things you can try:
1) You are reading input.sam multiple times. It only needs to be read once before your first loop starts. Save the ids to a temporary file which will be read by grep.
2) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8. This will speed up grep.
3) Use fgrep because you're searching for a fixed string, not a regular expression.
4) Use -f to make grep read patterns from a file, rather than using a loop.
5) Don't write to the output file from multiple processes as you may end up with lines interleaving and a corrupt file.
After making those changes, this is what your script would become:
awk '{print $1}' input.sam > idsFile.txt
for z in {a..z}
do
for x in {a..z}
do
for y in {a..z}
do
LC_ALL=C fgrep -f idsFile.txt sample_"$z""$x""$y" | awk '{print $1,$10,$11}'
done >> output.txt
Also, check out GNU Parallel which is designed to help you run jobs in parallel.

My initial thoughts are that you're repeatedly spawning grep. Spawning processes is very expensive (relatively) and I think you'd be better off with some sort of scripted solution (e.g. Perl) that doesn't require the continual process creation
e.g. for each inner loop you're kicking off cat and awk (you won't need cat since awk can read files, and in fact doesn't this cat/awk combination return the same thing each time?) and then grep. Then you wait for 4 greps to finish and you go around again.
If you have to use grep, you can use
grep -f filename
to specify the set of patterns to match in the filename, rather than a single pattern on the command line. I suspect form the above you can pre-generate such a list.

ok I have a test file containing 4 character strings ie aaaa aaab aaac etc
ls -lh test.txt
-rw-r--r-- 1 root pete 1.9G Jan 30 11:55 test.txt
time grep -e aaa -e bbb test.txt
<output>
real 0m19.250s
user 0m8.578s
sys 0m1.254s
time grep --mmap -e aaa -e bbb test.txt
<output>
real 0m18.087s
user 0m8.709s
sys 0m1.198s
So using the mmap option shows a clear improvement on a 2 GB file with two search patterns, if you take #BrianAgnew's advice and use a single invocation of grep try the --mmap option.
Though it should be noted that mmap can be a bit quirky if the source files changes during the search.
from man grep
--mmap
If possible, use the mmap(2) system call to read input, instead of the default read(2) system call. In some situations, --mmap yields better performance. However, --mmap can cause undefined behavior (including core dumps) if an input file shrinks while grep is operating, or if an I/O error occurs.

Using GNU Parallel it would look like this:
awk '{print $1}' input.sam > idsFile.txt
doit() {
LC_ALL=C fgrep -f idsFile.txt sample_"$1" | awk '{print $1,$10,$11}'
}
export -f doit
parallel doit {1}{2}{3} ::: {a..z} ::: {a..z} ::: {a..z} > output.txt
If the order of the lines is not important this will be a bit faster:
parallel --line-buffer doit {1}{2}{3} ::: {a..z} ::: {a..z} ::: {a..z} > output.txt

Related

using "ls" and preserving the spaces in the resulting array

I am trying to read a directory with "ls" and do operations on it
directory example:
$ ls -1
x x
y y
z z
script file: myScript.sh
#!/bin/bash
files=(`ls -1`);
for ((i=0; i<"${#files[#]}"; i+=1 )); do
echo "${files[$i]}"
done
however, the output is
$ myScript.sh
x
x
y
y
z
z
yet if I define "files" in the following way
$ files=("x x" "y y" "z z")
$ for ((i=0; i<"${#files[#]}"; i+=1 )); do echo "${files[$i]}"; done
x x
y y
z z
How can I preserve the spaces in "files=(`ls -1`)"?
Don't.
See:
ParsingLs
BashPitfalls #1
If at all possible, use a shell glob instead.
That is to say:
files=( * )
If you need to represent filenames as a stream of text, use NUL delimiters.
That is to say, either:
printf '%s\0' *
or
find . -mindepth 1 -maxdepth 1 -print0
will emit a NUL-delimited string, which you can load into a shell array safely using (in modern bash 4.x):
readarray -d '' array < <(find . -mindepth 1 -maxdepth 1 -print0)
...or, to support bash 3.x:
array=( )
while IFS= read -r -d '' name; do
array+=( "$name" )
done < <(find . -mindepth 1 -maxdepth 1 -print0)
In either of the above, that find command potentially being on the other side of a FIFO, network stream, or other remoting layer (assuming that there's some complexity of that sort stopping you from using a native shell glob).
It seems the main conclusion is not to use ls. Back in Pleistocene age of Unix programming, they used ls; however, these days, ls is best-restricted to producing human-readable displays only. A robust script for anything that can be thrown at your script (end lines, white spaces, Chinese characters mixed with Hebrew and French, or whatever), is best achieved by some form of globbing (as recommended by others here BashPitfalls).
#!/bin/bash
for file in ./*; do
[ -e "${file}" ] || continue
# do some task, for example, test if it is a directory.
if [ -d "${file}" ]; then
echo "${file}"
fi
done
The ./ is maybe not absolutely necessary, but it may help if the file begins with a "-", clarifying which file has the return line (or lines), and likely some other nasty buggers. This is also a useful template for specific files (.e.g, ./*.pdf). For example, suppose somehow the following files are in your directory: "-t" and "<CR>t". Then (revealing other issues with ls when using nonstandard characters)
$ ls
-t ?t
$ for file in *; do ls "${file}"; done
-t ?t
?t
whereas:
$ for file in ./*; do ls "${file}"; done
./-t
./?t
also
$ for file in ./*; do echo "${file}"; done
./-t
./
t
A workaround with POSIX commands can be achieved by --
$ for file in *; do ls -- "${file}"; done # work around
-t
?t
Try this:
eval files=($(ls -Q))
Option -Q enables quoting of filenames.
Option -1 is implied (not needed), if the output is not a tty.

Shell Script regex matches to array and process each array element

While I've handled this task in other languages easily, I'm at a loss for which commands to use when Shell Scripting (CentOS/BASH)
I have some regex that provides many matches in a file I've read to a variable, and would like to take the regex matches to an array to loop over and process each entry.
Regex I typically use https://regexr.com/ to form my capture groups, and throw that to JS/Python/Go to get an array and loop - but in Shell Scripting, not sure what I can use.
So far I've played with "sed" to find all matches and replace, but don't know if it's capable of returning an array to loop from matches.
Take regex, run on file, get array back. I would love some help with Shell Scripting for this task.
EDIT:
Based on comments, put this together (not working via shellcheck.net):
#!/bin/sh
examplefile="
asset('1a/1b/1c.ext')
asset('2a/2b/2c.ext')
asset('3a/3b/3c.ext')
"
examplearr=($(sed 'asset\((.*)\)' $examplefile))
for el in ${!examplearr[*]}
do
echo "${examplearr[$el]}"
done
This works in bash on a mac:
#!/bin/sh
examplefile="
asset('1a/1b/1c.ext')
asset('2a/2b/2c.ext')
asset('3a/3b/3c.ext')
"
examplearr=(`echo "$examplefile" | sed -e '/.*/s/asset(\(.*\))/\1/'`)
for el in ${examplearr[*]}; do
echo "$el"
done
output:
'1a/1b/1c.ext'
'2a/2b/2c.ext'
'3a/3b/3c.ext'
Note the wrapping of $examplefile in quotes, and the use of sed to replace the entire line with the match. If there will be other content in the file, either on the same lines as the "asset" string or in other lines with no assets at all you can refine it like this:
#!/bin/sh
examplefile="
fooasset('1a/1b/1c.ext')
asset('2a/2b/2c.ext')bar
foobar
fooasset('3a/3b/3c.ext')bar
"
examplearr=(`echo "$examplefile" | grep asset | sed -e '/.*/s/^.*asset(\(.*\)).*$/\1/'`)
for el in ${examplearr[*]}; do
echo "$el"
done
and achieve the same result.
There are several ways to do this. I'd do with GNU grep with perl-compatible regex (ah, delightful line noise):
mapfile -t examplearr < <(grep -oP '(?<=[(]).*?(?=[)])' <<<"$examplefile")
for i in "${!examplearr[#]}"; do printf "%d\t%s\n" $i "${examplearr[i]}"; done
0 '1a/1b/1c.ext'
1 '2a/2b/2c.ext'
2 '3a/3b/3c.ext'
This uses the bash mapfile command to read lines from stdin and assign them to an array.
The bits you're missing from the sed command:
$examplefile is text, not a filename, so you have to send to to sed's stdin
sed's a funny little language with 1-character commands: you've given it the "a" command, which is inappropriate in this case.
you only want to output the captured parts of the matches, not every line, so you need the -n option, and you need to print somewhere: the p flag in s///p means "print the [line] if a substitution was made".
sed -n 's/asset\(([^)]*)\)/\1/p' <<<"$examplefile"
# or
echo "$examplefile" | sed -n 's/asset\(([^)]*)\)/\1/p'
Note that this returns values like ('1a/1b/1c.ext') -- with the parentheses. If you don't want them, add the -r or -E option to sed: among other things, that flips the meaning of ( and \(

Proper way to keep array from pipe BASH

I saw quite a few different solutions to resolve an issue with keeping an array from a pipe however none seemed to do the trick for me, currently my script works correctly however the array "databasesarray" is lost upon "done", how would I go about keeping this information with my complex pipe scheme?
databasesarray=()
N=0
dbs -d 123123 | grep db|awk '{print $2}'|while read db;
do
databasesarray[$N]="$db";
databasesarray[$N]+=$(gdb $db|grep dn);
echo ${N} ${databasesarray[$N]};
N=$(($N + 1));
done
Better and more efficient way of filling up array in a loop:
databasesarray=()
while read -r db; do
databasesarray+=( "$db $(gdb "$db"|grep "dn")" )
done < <(dbs -d 123123 | awk '/db/{print $2}')
Your grep and awk can be combined into one
Instead of pipe with while it better to use process substitution < <(...) syntax
PS: You could use read -a for filling up array:
read -a databasesarray < <(dbs -d 123123 | awk '/db/{print $2}')

Multiple grep keywords on same line?

I'm using the command grep 3 times on the same line like this
ls -1F ./ | grep / | grep -v 0_*.* | grep -v undesired_result
is there a way to combine them into one command instead of having it to pipe it 3 times?
There's no way to do both a positive search (grep <something>) and a negative search (grep -v <something>) in one command line, but if your grep supports -E (alternatively, egrep), you could do ls -1F ./ | grep / | grep -E -v '0_*.*|undesired_result' to reduce the sub-process count by one. To go beyond that, you'd have to come up with a specific regular expression that matches either exactly what you want or everything you don't want.
Actually, I guess that first sentence isn't entirely true if you have egrep, but building the proper regular expression that correctly includes both the positive and negative parts and covers all possible orderings of the parts might be more frustrating than it's worth...

Counting words and delete strings from a text file in unix

I have a question for you: I have a big log file and I want to clean it. I'm interested only in strings which contain determinate word and I want to delete the other strings. i.e.:
access ok from place1
access ko from place1
access ok from place2
access ko from place2
access ok from place3
access ko from place3
......
And I want to obtain only the 'place2' entry:
access ok from place2
access ko from place2
How can I do it?
Thanks in advance!
grep "place2" /path/to/log/file > cleanedFile.txt
I wrote a blog post about combining find/sed/grep - you might be interested.
Try this grep command:
grep "\<place2\>" log-file > out-file
\< and \> will make sure to match full word thus inplace2 will NOT be matched.
grep "\<place2\>" file.log > file.out
wc file.out
wc (word count) for counting the words. But for 2 questions, you should normally open two questions. :)
Another take, select lines where the 4th column equals "place2"
awk '$4 == "place2"' file
Unlike most other answers, this modifies the file in-place and does not need further renaming.
sed -i -n '/place2/p' /var/log/file
This assumes GNU sed. If you don't have GNU sed but have perl:
perl -i -ne '/place2/ && print' /var/log/file
These 2 examples does in-place editing as well.
$ awk '$NF=="place2"{print $0>FILENAME}' file
$ ruby -i.bak -ane 'print if $F[-1]=="place2"' file
There are other ways to files these lines
sed -i.bak -n '/place2$/p' file
grep 'place2$' file > temp && mv temp file
Purely using the shell
while read -r line; do case $line in *place2) echo "$line";; esac; done < file > temp && mv temp file

Resources