Multiple grep keywords on same line? - batch-file

I'm using the command grep 3 times on the same line like this
ls -1F ./ | grep / | grep -v 0_*.* | grep -v undesired_result
is there a way to combine them into one command instead of having it to pipe it 3 times?

There's no way to do both a positive search (grep <something>) and a negative search (grep -v <something>) in one command line, but if your grep supports -E (alternatively, egrep), you could do ls -1F ./ | grep / | grep -E -v '0_*.*|undesired_result' to reduce the sub-process count by one. To go beyond that, you'd have to come up with a specific regular expression that matches either exactly what you want or everything you don't want.
Actually, I guess that first sentence isn't entirely true if you have egrep, but building the proper regular expression that correctly includes both the positive and negative parts and covers all possible orderings of the parts might be more frustrating than it's worth...

Related

Counting the number of files in a directory that contain the different variables in my array - bash script

I have a bash script, which needs to check certain files for certain variables, and count how many files come back containing those variables.
As there is more than one variable I need to look for I decided to to use an array for the variables.
The code I am using is below:
#!/bin/bash
declare -a MYARRAY=('Variable One' 'Variable Two' 'Variable Three');
COUNT_MYARRAY=$(find $DIRECTORY -mtime -1 -exec grep -ln $MYARRAY {} \; | wc -l)
I have declared the $DIRECTORY in my real script.
However, it does not seem to pick up files if they have the second and third variable within?
Can anyone see where I might be going wrong?
You can use greps regex support and pass multiple expressions using 'var1\|var2'. First construct the grep argument and then execute grep.
You don't need line numbers -n to grep to count the files...
grep can handle multiple files - it will be faster to pass multiple files to one grep with -exec ... +, rather then spawn grep for each file.
UPPER_CASE_VARIABLES are shouting at me and by convention upper vase variables are reserved for exported variables.
myarray=('Variable One' 'Variable Two' 'Variable Three')
arg=$(printf "%s\|" "${MYARRAY[#]}" | sed 's/\\|$//')
directory=.
count_myarray=$(find "$directory" -type f -mtime -1 -exec grep -l "$arg" {} + | wc -l)
Alternatively: you can pass multiple -exec arguments to find. So first from myarray construct arguments to find in the form -exec grep -l <the var>. Note that multiple variables can be in same files, so get unique filenames after grepping.
myarray=('Variable One' 'Variable Two' 'Variable Three');
findargs=()
for i in "${MYARRAY[#]}"; do
findargs+=(-exec grep -l "$i" {} +)
done
directory=.
count_myarray=$(find "$directory" -type f -mtime -1 "${findargs[#]}" | sort -u | wc -l)
or similar:
count_myarray=$(printf '-exec\0grep\0-l\0%s\0{}\0+\0' "${myarray[#]}" | xargs -0 find "$directory" -type f -mtime -1 | sort -u | wc -l)
Remember to quote your variable expansions to protect against whitespaces or special characters in filenames and directory names.
Going wrong:
With echo $MYARRAY you find Variable One, not the string you want for grep.
Also note that it is better to use lowercase for your variable names. I will use ${directory} and not $DIRECTORY (and in double quotes for directories with a space).
You have more options with grep. When you want a file with 8 occurances counted one, you can not use the grep option -c. An useful option is -r. You are looking for something like
grep -Erl "Variable One|Variable Two|Variable Three" | wc -l
This is difficult when the variables might have special characters like $or |.
Another option of grep is using the option
-f FILE, Obtain patterns from FILE, one per line
So you should make a function that writes the variables to a file, and use something like
grep -rlFf "myVariablesFile" "${directory}" | wc -l
When the content of the file is changing rapidly, you might want to avoid the temporary file with
grep -rlFf <(function_that_writes_variables_to_stdout) "${directory}"| wc -l
or directly
grep -rlFf <(printf "%s\n" "${var1}" "${var2}" "${var3}") "${directory}" | wc -l

How to remove numbers from extensions from files

I have many files in a directory having extension like
.text(2) and .text(1).
I want to remove the numbers from extension and output should be like
.text and .text .
can anyone please help me with the shell script for that?
I am using centOs.
A pretty portable way of doing it would be this:
for i in *.text*; do mv "$i" "$(echo "$i" | sed 's/([0-9]\{1,\})$//')"; done
Loop through all files which end in .text followed by anything. Use sed to remove any parentheses containing one or more digits from the end of each filename.
If all of the numbers within the parentheses are single digits and you're using bash, you could also use built-in parameter expansion:
for i in *.text*; do mv "$i" "${i%([0-9])}"; done
The expansion removes any parentheses containing a single digit from the end of each filename.
Another way without loops, but also with sed (and all the regexp's inside) is piping to sh:
ls *text* | sed 's/\(.*\)\..*/mv \1* \1.text/' | sh
Example:
[...]$ ls
xxxx.text(1) yyyy.text(2)
[...]$ ls *text* | sed 's/\(.*\)\..*/mv \1* \1.text/' | sh
[...]$ ls
xxxx.text yyyy.text
Explanation:
Everything between \( and \) is stored and can be pasted again by \1 (or \2, \3, ... a consecutive number for each pair of parentheses used). Therefore, the code above stores all the characters before the first dot \. and after that, compounds a sequence like this:
mv xxxx* xxxx.text
mv yyyy* yyyy.text
That is piped to sh
Most simple way if files are in same folder
rename 's/text\([0-9]+\)/text/' *.text*
link

Faster grep function for big (27GB) files

I have to grep from a file (5MB) containing specific strings the same strings (and other information) from a big file (27GB).
To speed up the analysis I split the 27GB file into 1GB files and then applied the following script (with the help of some people here). However it is not very efficient (to produce a 180KB file it takes 30 hours!).
Here's the script. Is there a more appropriate tool than grep? Or a more efficient way to use grep?
#!/bin/bash
NR_CPUS=4
count=0
for z in `echo {a..z}` ;
do
for x in `echo {a..z}` ;
do
for y in `echo {a..z}` ;
do
for ids in $(cat input.sam|awk '{print $1}');
do
grep $ids sample_"$z""$x""$y"|awk '{print $1" "$10" "$11}' >> output.txt &
let count+=1
[[ $((count%NR_CPUS)) -eq 0 ]] && wait
done
done #&
A few things you can try:
1) You are reading input.sam multiple times. It only needs to be read once before your first loop starts. Save the ids to a temporary file which will be read by grep.
2) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8. This will speed up grep.
3) Use fgrep because you're searching for a fixed string, not a regular expression.
4) Use -f to make grep read patterns from a file, rather than using a loop.
5) Don't write to the output file from multiple processes as you may end up with lines interleaving and a corrupt file.
After making those changes, this is what your script would become:
awk '{print $1}' input.sam > idsFile.txt
for z in {a..z}
do
for x in {a..z}
do
for y in {a..z}
do
LC_ALL=C fgrep -f idsFile.txt sample_"$z""$x""$y" | awk '{print $1,$10,$11}'
done >> output.txt
Also, check out GNU Parallel which is designed to help you run jobs in parallel.
My initial thoughts are that you're repeatedly spawning grep. Spawning processes is very expensive (relatively) and I think you'd be better off with some sort of scripted solution (e.g. Perl) that doesn't require the continual process creation
e.g. for each inner loop you're kicking off cat and awk (you won't need cat since awk can read files, and in fact doesn't this cat/awk combination return the same thing each time?) and then grep. Then you wait for 4 greps to finish and you go around again.
If you have to use grep, you can use
grep -f filename
to specify the set of patterns to match in the filename, rather than a single pattern on the command line. I suspect form the above you can pre-generate such a list.
ok I have a test file containing 4 character strings ie aaaa aaab aaac etc
ls -lh test.txt
-rw-r--r-- 1 root pete 1.9G Jan 30 11:55 test.txt
time grep -e aaa -e bbb test.txt
<output>
real 0m19.250s
user 0m8.578s
sys 0m1.254s
time grep --mmap -e aaa -e bbb test.txt
<output>
real 0m18.087s
user 0m8.709s
sys 0m1.198s
So using the mmap option shows a clear improvement on a 2 GB file with two search patterns, if you take #BrianAgnew's advice and use a single invocation of grep try the --mmap option.
Though it should be noted that mmap can be a bit quirky if the source files changes during the search.
from man grep
--mmap
If possible, use the mmap(2) system call to read input, instead of the default read(2) system call. In some situations, --mmap yields better performance. However, --mmap can cause undefined behavior (including core dumps) if an input file shrinks while grep is operating, or if an I/O error occurs.
Using GNU Parallel it would look like this:
awk '{print $1}' input.sam > idsFile.txt
doit() {
LC_ALL=C fgrep -f idsFile.txt sample_"$1" | awk '{print $1,$10,$11}'
}
export -f doit
parallel doit {1}{2}{3} ::: {a..z} ::: {a..z} ::: {a..z} > output.txt
If the order of the lines is not important this will be a bit faster:
parallel --line-buffer doit {1}{2}{3} ::: {a..z} ::: {a..z} ::: {a..z} > output.txt

BASH: Send output of complex command to an array

NOTE FROM OP: Oops. My mistake. I happened to let grep hunt for something(s) non-existent. Of course I got no output. And yes, this is a dup of another question.
<><><><><><><><><><><><><><><><><><><><>
There are many answers on the web to (most of) this question. The "most of" part is my problem.
How do I capture the output of a command line into a bash array when the command line contains pipe chars, "|"?
array=($(ps -ef | grep myproc | grep -v grep))
doesn't work. Neither does:
array=(`ps -ef | grep myproc | grep -v grep`)
(those are backquotes in case your font mangles them).
And, can the given answer be use with array+= syntax?
array=($(ps -ef | grep myproc | grep -v grep))
works perfectly well. You can check it when you show the number of elements in your array
echo ${#array[*]}
or the complete array with
echo ${array[*]}

script for getting extensions of a file

I need to get all the file extension types in a folder. For instance, if the directory's ls gives the following:
a.t
b.t.pg
c.bin
d.bin
e.old
f.txt
g.txt
I should get this by running the script
.t
.t.pg
.bin
.old
.txt
I have a bash shell.
Thanks a lot!
See the BashFAQ entry on ParsingLS for a description of why many of these answers are evil.
The following approach avoids this pitfall (and, by the way, completely ignores files with no extension):
shopt -s nullglob
for f in *.*; do
printf '%s\n' ".${f#*.}"
done | sort -u
Among the advantages:
Correctness: ls behaves inconsistently and can result in inappropriate results. See the link at the top.
Efficiency: Minimizes the number of subprocess invoked (only one, sort -u, and that could be removed also if we wanted to use Bash 4's associative arrays to store results)
Things that still could be improved:
Correctness: this will correctly discard newlines in filenames before the first . (which some other answers won't) -- but filenames with newlines after the first . will be treated as separate entries by sort. This could be fixed by using nulls as the delimiter, or by the aforementioned bash 4 associative-array storage approach.
try this:
ls -1 | sed 's/^[^.]*\(\..*\)$/\1/' | sort -u
ls lists files in your folder, one file per line
sed magic extracts extensions
sort -u sorts extensions and removes duplicates
sed magic reads as:
s/ / /: substitutes whatever is between first and second / by whatever is between second and third /
^: match beginning of line
[^.]: match any character that is not a dot
*: match it as many times as possible
\( and \): remember whatever is matched between these two parentheses
\.: match a dot
.: match any character
*: match it as many times as possible
$: match end of line
\1: this is what has been matched between parentheses
People are really over-complicating this - particularly the regex:
ls | grep -o "\..*" | uniq
ls - get all the files
grep -o "\..*" - -o only show the match; "\..*" match at the first "." & everything after it
uniq - don't print duplicates but keep the same order
you can also sort if you like, but sorting doesn't match the example
This is what happens when you run it:
> ls -1
a.t
a.t.pg
c.bin
d.bin
e.old
f.txt
g.txt
> ls | grep -o "\..*" | uniq
.t
.t.pg
.bin
.old
.txt

Resources