Counting words and delete strings from a text file in unix - file

I have a question for you: I have a big log file and I want to clean it. I'm interested only in strings which contain determinate word and I want to delete the other strings. i.e.:
access ok from place1
access ko from place1
access ok from place2
access ko from place2
access ok from place3
access ko from place3
......
And I want to obtain only the 'place2' entry:
access ok from place2
access ko from place2
How can I do it?
Thanks in advance!

grep "place2" /path/to/log/file > cleanedFile.txt
I wrote a blog post about combining find/sed/grep - you might be interested.

Try this grep command:
grep "\<place2\>" log-file > out-file
\< and \> will make sure to match full word thus inplace2 will NOT be matched.

grep "\<place2\>" file.log > file.out
wc file.out
wc (word count) for counting the words. But for 2 questions, you should normally open two questions. :)

Another take, select lines where the 4th column equals "place2"
awk '$4 == "place2"' file

Unlike most other answers, this modifies the file in-place and does not need further renaming.
sed -i -n '/place2/p' /var/log/file
This assumes GNU sed. If you don't have GNU sed but have perl:
perl -i -ne '/place2/ && print' /var/log/file

These 2 examples does in-place editing as well.
$ awk '$NF=="place2"{print $0>FILENAME}' file
$ ruby -i.bak -ane 'print if $F[-1]=="place2"' file
There are other ways to files these lines
sed -i.bak -n '/place2$/p' file
grep 'place2$' file > temp && mv temp file
Purely using the shell
while read -r line; do case $line in *place2) echo "$line";; esac; done < file > temp && mv temp file

Related

Shell Script regex matches to array and process each array element

While I've handled this task in other languages easily, I'm at a loss for which commands to use when Shell Scripting (CentOS/BASH)
I have some regex that provides many matches in a file I've read to a variable, and would like to take the regex matches to an array to loop over and process each entry.
Regex I typically use https://regexr.com/ to form my capture groups, and throw that to JS/Python/Go to get an array and loop - but in Shell Scripting, not sure what I can use.
So far I've played with "sed" to find all matches and replace, but don't know if it's capable of returning an array to loop from matches.
Take regex, run on file, get array back. I would love some help with Shell Scripting for this task.
EDIT:
Based on comments, put this together (not working via shellcheck.net):
#!/bin/sh
examplefile="
asset('1a/1b/1c.ext')
asset('2a/2b/2c.ext')
asset('3a/3b/3c.ext')
"
examplearr=($(sed 'asset\((.*)\)' $examplefile))
for el in ${!examplearr[*]}
do
echo "${examplearr[$el]}"
done
This works in bash on a mac:
#!/bin/sh
examplefile="
asset('1a/1b/1c.ext')
asset('2a/2b/2c.ext')
asset('3a/3b/3c.ext')
"
examplearr=(`echo "$examplefile" | sed -e '/.*/s/asset(\(.*\))/\1/'`)
for el in ${examplearr[*]}; do
echo "$el"
done
output:
'1a/1b/1c.ext'
'2a/2b/2c.ext'
'3a/3b/3c.ext'
Note the wrapping of $examplefile in quotes, and the use of sed to replace the entire line with the match. If there will be other content in the file, either on the same lines as the "asset" string or in other lines with no assets at all you can refine it like this:
#!/bin/sh
examplefile="
fooasset('1a/1b/1c.ext')
asset('2a/2b/2c.ext')bar
foobar
fooasset('3a/3b/3c.ext')bar
"
examplearr=(`echo "$examplefile" | grep asset | sed -e '/.*/s/^.*asset(\(.*\)).*$/\1/'`)
for el in ${examplearr[*]}; do
echo "$el"
done
and achieve the same result.
There are several ways to do this. I'd do with GNU grep with perl-compatible regex (ah, delightful line noise):
mapfile -t examplearr < <(grep -oP '(?<=[(]).*?(?=[)])' <<<"$examplefile")
for i in "${!examplearr[#]}"; do printf "%d\t%s\n" $i "${examplearr[i]}"; done
0 '1a/1b/1c.ext'
1 '2a/2b/2c.ext'
2 '3a/3b/3c.ext'
This uses the bash mapfile command to read lines from stdin and assign them to an array.
The bits you're missing from the sed command:
$examplefile is text, not a filename, so you have to send to to sed's stdin
sed's a funny little language with 1-character commands: you've given it the "a" command, which is inappropriate in this case.
you only want to output the captured parts of the matches, not every line, so you need the -n option, and you need to print somewhere: the p flag in s///p means "print the [line] if a substitution was made".
sed -n 's/asset\(([^)]*)\)/\1/p' <<<"$examplefile"
# or
echo "$examplefile" | sed -n 's/asset\(([^)]*)\)/\1/p'
Note that this returns values like ('1a/1b/1c.ext') -- with the parentheses. If you don't want them, add the -r or -E option to sed: among other things, that flips the meaning of ( and \(

How to edit a file with shell scripting

I have a file containing thousands of lines like this:
0x7f29139ec6b3: W 0x7fff06bbf0a8
0x7f29139f0010: W 0x7fff06bbf0a0
0x7f29139f0014: W 0x7fff06bbf098
0x7f29139f0016: W 0x7fff06bbf090
0x7f29139f0036: R 0x7f2913c0db80
I want to make a new file which contains only the second hex number on each line (the part marked in bold above)
I have to put all these hex numbers in an array in a C program. So I am trying to make a file with only the hex numbers on the right hand side, so that my C program can use the fscanf function to read these numbers from the modified file.
I guess we can use some shell script to make a file containing those hex numbers? grep or something?
You can use sed and edit inplace. For matching "R" or any other char use
sed -i "s/.*:..//g" file
cat file
0x7fff06bbf0a8
0x7fff06bbf0a0
0x7fff06bbf098
0x7fff06bbf090
You can use grep -oP command:
grep -oP ' \K0x[a-fA-F0-9]*' file
0x7fff06bbf0a8
0x7fff06bbf0a0
0x7fff06bbf098
0x7fff06bbf090
0x7f2913c0db80
You can run a command on the file that will create a new file in the format you want: somecommand <oldfile >newfile. That will leave the original file intact and create a new one for you to feed to your C program.
As to what somecommand should be, you have multiple options. The easiest is probably awk:
awk '{print $NF}'
But you can also do it with sed or grep or perl or cut ... see other answers for an embarrassment of choices.
Since it seems that you always want to select the third field, the simplest approach is to use cut:
cut -d ' ' -f 3 file
or awk:
awk '{print $3}' file

How to store elements with whitespace in an array?

Just wondering, assuming I am storing my data in a file called BookDB.txt in the following format :
C++ for dummies:Jared:10.52:5:6
Java for dummies:David:10.65:4:6
whereby each field is seperated by the delimeter ":".
How would I preserve whitespace in the first field and have an array with the following contents : ('C++ for dummies' 'Java for dummies')?
Any help is very much appreciated!
Ploutox's solution is almost correct, but without setting IFS, you will not get the array that you seek, with two elements in this case.
Note: He corrected his solution after this post.
IFS=$'\n': arr=( $(awk -F':' '{print $1 }' Input.txt ) )
echo ${#arr[#]}
echo ${arr[0]}
echo ${arr[1]}
Output:
2
C++ for dummies
Java for dummies
Just use a while loop:
#!/bin/bash
# create and populate the array
a=()
while IFS=':' read -r field _
do
a+=("$field")
done < file
# print the array contents
printf "%s\n" "${a[#]}"
I totally misunderstood your question on my 1st attempt to answer. awk seems more suited for your need though. You can get what you want with simple scripting :
IFS=$'\n' : MYARRAY=($(awk -F ":" '{print $1}' myfile))
the -F flag forces : as the field separator.
echo ${MYARRAY[0]} will print :
C++ for dummies
$ yes sed -i "s/:/\'\'/" BookDB.txt | head -n100 | bash
this command while work. this is a linux command, run it on shell in same path with BookDB.txt

Faster grep function for big (27GB) files

I have to grep from a file (5MB) containing specific strings the same strings (and other information) from a big file (27GB).
To speed up the analysis I split the 27GB file into 1GB files and then applied the following script (with the help of some people here). However it is not very efficient (to produce a 180KB file it takes 30 hours!).
Here's the script. Is there a more appropriate tool than grep? Or a more efficient way to use grep?
#!/bin/bash
NR_CPUS=4
count=0
for z in `echo {a..z}` ;
do
for x in `echo {a..z}` ;
do
for y in `echo {a..z}` ;
do
for ids in $(cat input.sam|awk '{print $1}');
do
grep $ids sample_"$z""$x""$y"|awk '{print $1" "$10" "$11}' >> output.txt &
let count+=1
[[ $((count%NR_CPUS)) -eq 0 ]] && wait
done
done #&
A few things you can try:
1) You are reading input.sam multiple times. It only needs to be read once before your first loop starts. Save the ids to a temporary file which will be read by grep.
2) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8. This will speed up grep.
3) Use fgrep because you're searching for a fixed string, not a regular expression.
4) Use -f to make grep read patterns from a file, rather than using a loop.
5) Don't write to the output file from multiple processes as you may end up with lines interleaving and a corrupt file.
After making those changes, this is what your script would become:
awk '{print $1}' input.sam > idsFile.txt
for z in {a..z}
do
for x in {a..z}
do
for y in {a..z}
do
LC_ALL=C fgrep -f idsFile.txt sample_"$z""$x""$y" | awk '{print $1,$10,$11}'
done >> output.txt
Also, check out GNU Parallel which is designed to help you run jobs in parallel.
My initial thoughts are that you're repeatedly spawning grep. Spawning processes is very expensive (relatively) and I think you'd be better off with some sort of scripted solution (e.g. Perl) that doesn't require the continual process creation
e.g. for each inner loop you're kicking off cat and awk (you won't need cat since awk can read files, and in fact doesn't this cat/awk combination return the same thing each time?) and then grep. Then you wait for 4 greps to finish and you go around again.
If you have to use grep, you can use
grep -f filename
to specify the set of patterns to match in the filename, rather than a single pattern on the command line. I suspect form the above you can pre-generate such a list.
ok I have a test file containing 4 character strings ie aaaa aaab aaac etc
ls -lh test.txt
-rw-r--r-- 1 root pete 1.9G Jan 30 11:55 test.txt
time grep -e aaa -e bbb test.txt
<output>
real 0m19.250s
user 0m8.578s
sys 0m1.254s
time grep --mmap -e aaa -e bbb test.txt
<output>
real 0m18.087s
user 0m8.709s
sys 0m1.198s
So using the mmap option shows a clear improvement on a 2 GB file with two search patterns, if you take #BrianAgnew's advice and use a single invocation of grep try the --mmap option.
Though it should be noted that mmap can be a bit quirky if the source files changes during the search.
from man grep
--mmap
If possible, use the mmap(2) system call to read input, instead of the default read(2) system call. In some situations, --mmap yields better performance. However, --mmap can cause undefined behavior (including core dumps) if an input file shrinks while grep is operating, or if an I/O error occurs.
Using GNU Parallel it would look like this:
awk '{print $1}' input.sam > idsFile.txt
doit() {
LC_ALL=C fgrep -f idsFile.txt sample_"$1" | awk '{print $1,$10,$11}'
}
export -f doit
parallel doit {1}{2}{3} ::: {a..z} ::: {a..z} ::: {a..z} > output.txt
If the order of the lines is not important this will be a bit faster:
parallel --line-buffer doit {1}{2}{3} ::: {a..z} ::: {a..z} ::: {a..z} > output.txt

How do I capture the output from the ls or find command to store all file names in an array?

Need to process files in current directory one at a time. I am looking for a way to take the output of ls or find and store the resulting value as elements of an array. This way I can manipulate the array elements as needed.
To answer your exact question, use the following:
arr=( $(find /path/to/toplevel/dir -type f) )
Example
$ find . -type f
./test1.txt
./test2.txt
./test3.txt
$ arr=( $(find . -type f) )
$ echo ${#arr[#]}
3
$ echo ${arr[#]}
./test1.txt ./test2.txt ./test3.txt
$ echo ${arr[0]}
./test1.txt
However, if you just want to process files one at a time, you can either use find's -exec option if the script is somewhat simple, or you can do a loop over what find returns like so:
while IFS= read -r -d $'\0' file; do
# stuff with "$file" here
done < <(find /path/to/toplevel/dir -type f -print0)
for i in `ls`; do echo $i; done;
can't get simpler than that!
edit: hmm - as per Dennis Williamson's comment, it seems you can!
edit 2: although the OP specifically asks how to parse the output of ls, I just wanted to point out that, as the commentators below have said, the correct answer is "you don't". Use for i in * or similar instead.
You actually don't need to use ls/find for files in current directory.
Just use a for loop:
for files in *; do
if [ -f "$files" ]; then
# do something
fi
done
And if you want to process hidden files too, you can set the relative option:
shopt -s dotglob
This last command works in bash only.
Depending on what you want to do, you could use xargs:
ls directory | xargs cp -v dir2
For example. xargs will act on each item returned.

Resources