bash: looping over the files with extra conditions - arrays

In the working directory there are several files grouped into several groups based on the end-suffix of the file name. Here is the example for 4 groups:
# group 1 has 5 files
NpXynWT_apo_300K_1.pdb
NpXynWT_apo_300K_2.pdb
NpXynWT_apo_300K_3.pdb
NpXynWT_apo_300K_4.pdb
NpXynWT_apo_300K_5.pdb
# group 2 has two files
NpXynWT_apo_340K_1.pdb
NpXynWT_apo_340K_2.pdb
# group 3 has 4 files
NpXynWT_com_300K_1.pdb
NpXynWT_com_300K_2.pdb
NpXynWT_com_300K_3.pdb
NpXynWT_com_300K_4.pdb
# group 4 has 1 file
NpXynWT_com_340K_1.pdb
I have wrote a simple bash workflow to
List item pre-process each of the fille via SED: add something within each of file
cat together the pre-processed files that belongs to the same group
Here is my script for the realisation of the workflow where I created an array with the names of the groups and looped it according to file index from 1 to 5
# list of 4 groups
systems=(NpXynWT_apo_300K NpXynWT_apo_340K NpXynWT_com_300K NpXynWT_com_340K)
# loop over the groups
for model in "${systems[#]}"; do
# loop over the files inside of each group
for i in {0001..0005}; do
# edit file via SED
sed -i "1 i\This is $i file of the group" "${pdbs}"/"${model}"_"$i"_FA.pdb
done
# after editing cat the pre-processed filles
cat "${pdbs}"/"${model}"_[1-5]_FA.pdb > "${output}/${model}.pdb"
done
The questions to improve this script:
1) how it would be possible to add within the inner (while) loop some checking conditions (e.g. by means of IF statement) to consider only existing files? In my example the script always loops 5 files (for each group) according to the maximum number in one of the group (here 5 files in the first group)
for i in {0001..0005}; do
I would rather to loop along all of the existing files of the given group and break the while loop in the case if the file does not exist (e.g. considering the 4th group with only 1 file). Here is the example, which however does not work properly
# loop over the groups with the checking of the presence of the file
for model in "${systems[#]}"; do
i="0"
# loop over the files inside of each group
for i in {0001..9999}; do
if [ ! -f "${pdbs}/${model}_00${i}_FA.pdb" ]; then
echo 'File '${pdbs}/${model}_00${i}_FA.pdb' does not exits!'
break
else
# edit file via SED
sed -i "1 i\This is $i file of the group" "${pdbs}"/"${model}"_00"$i"_FA.pdb
i=$[$i+1]
fi
done
done
Would it be possible to loop over any number of existing filles from the group (rather than just restricting to given e.g. very big number of files by
for i in {0001..9999}; do?

You can check if a file exists with the -f test, and break if it doesn't:
if [ ! -f "${pdbs}/${model}_${i}_FA.pdb" ]; then
break
fi
You existing cat command already does only count the existing files in each group, because "${pdbs}"/"${model}"_[1-5]_FA.pdb bash is performing filename expansion here, not simply expanding the [1-5] to all possible values. You can see this in the following example:
> touch f1 f2 f5 # files f3 and f4 do not exist
> echo f[1-5]
f1 f2 f5
Notice that f[1-5] did not expand to f1 f2 f3 f4 f5.
Update:
If you want your glob expression to match files ending in numbers bigger than 9, the [1-n] syntax will not work. The reason is that the [...] syntax defines a pattern that matches a single character. For instance, the expression foo[1-9] will match files foo1 through foo9, but not foo10 or foo99.
Doing something like foo[1-99] does not work, because it doesn't mean what you might think it means. The inside of the [] can contain any number of individual characters, or ranges of characters. For example, [1-9a-nxyz] would match any character from '1' through '9', from 'a' through 'n', or any of the characters 'x', 'y', or 'z', but it would not match '0', 'q', 'r', etc. Or for that matter, it would also not match any uppercase letters.
So [1-99] is not interpreted as the range of numbers from 1-99, it is interpreted as the set of characters comprised of the range from '1' to '9', plus the individual character '9'. Therefore the patterns [1-9] and [1-99] are equivalent, and will only match characters '1' through '9'. The second 9 in the latter expression is redundant.
However, you can still achieve what you want with extended globs, which you can enable with the command shopt -s extglob:
> touch f1 f2 f5 f99 f100000 f129828523
> echo f[1-99999999999] # Doesn't work like you want it to
f1 f2 f5
> shopt -s extglob
> echo f+([0-9])
f1 f2 f5 f99 f100000 f129828523
The +([0-9]) expression is an extended glob expression composed of two parts: the [0-9], whose meaning should be obvious at this point, and the enclosing +(...).
The +(pattern) syntax is an extglob expression that means match one or more instances of pattern. In this case, our pattern is [0-9], so the extglob expression +([0-9]) matches any string of digits 0-9.
However, you should note that this means it also matches things like 000000000. If you are only interested in numbers greater than or equal to 1, you would instead do (with extglob enabled):
> echo f[1-9]*([0-9])
Note the *(pattern) here instead of +(pattern). The * means match zero or more instances of pattern. Which we want because we've already matched the first digit with [1-9]. For instance, f[1-9]+([0-9]) does not match the filename f1.
You may not want to leave extglob enabled in your whole script, particularly if you have any regular glob expression elsewhere in your script that might accidentally be interpreted as an extglob expression. To disable extglob when you're done with it, do:
shopt -u extglob
There's one other important thing to note here. If a glob pattern doesn't match any files, then it is interpreted as a raw string, and is left unmodified.
For example:
> echo This_file_totally_does_not_exist*
This_file_totally_does_not_exist*
Or more to the point in your case, suppose there are zero files in your 4th case, e.g. there are no files containing NpXynWT_com_340K. In this case, if you try to use a glob containing NpXynWT_com_340K, you get the entire glob as a literal string:
> shopt -s extglob
> echo NpXynWT_com_340K_[1-9]*([0-9])
echo NpXynWT_com_340K_[1-9]*([0-9])
This is obviously not what you want, especially in the middle of your script where you are trying to cat the matching files. Luckily there is another option you can set to make non-matching globs expand to nothing:
> shopt -s nullglob
> echo This_file_totally_does_not_exist* # prints nothing
As with extglob, there may be unintended behavior elsewhere in your script if you leave nullglob on.

Related

sh - appending 0's to file names according to the max

I am trying to make a file sorter. In the current directory I have files named like this :
info-0.jpg
info-12.jpg
info-40.jpg
info-5.jpg
info-100.jpg
I want it to become
info-000.jpg
info-012.jpg
info-040.jpg
info-005.jpg
info-100.jpg
That is, append 0's so that the number of digits is equal to 3, because the max number was 100 and had 3 digits.
I would like to use cut and wc by doing a loop on each of the file names, If $1 is "info", for i in $1-*.jpg, but how. Thanks
I did this to start but get a syntax error
wcount=0
for i in $filename-*.jpg; do
wcount=$((echo $i | wc -c))
done
for f in info*.jpg ; do
numPart=${f%.*} ; #dbg echo numPart1=$numPart;
numPart=${numPart#*-}; #dbg echo numPart2=$numPart;
newFilename="${f%-*}"-$(printf '%03d' "$numPart")."${f##*.}"
echo /bin/mv "$f" "$newFilename"
done
The key is using printf with formatting that forces the width to 3 digits wide, and includes 0 padding; the printf "%03d" "$numPart" portion of the script.
Also the syntax ${f%.*} is a set of features offered by modern shells to remove parts of a variables value, where % means match (and destroy) the minimal match from the right side of the value, and ${numPart#*-} means match (and destroy) the minimal match from the left side of the value. There are also %% (maximum match from right) and ## (maximum match from left). Experiment with a variable on your command line get comfortable with this.
Triple check the output of this code in your environment and only when sure all mv commands look correct, remove the echo in front of /bin/mv.
If you get an error message like Can't find /bin/mv, then enter type mv and replace /bin/ with whatever path is returned for mv.
IHTH

Bash help tallying/parsing substrings

I have a shell script I wrote a while back, that reads a word list (HITLIST), and recursively searches a directory for all occurrences of those words. Each line containing a "hit" is appended to file (HITOUTPUT).
I have used this script a couple of times over the last year or so, and have noticed that we often get hits from frequent offenders, and that it would be nice if we kept a count of each "super-string" that is triggered, and automatically remove repeat offenders.
For instance, if my word list contains "for" I might get a hundred hits or so for "foreign" or "form" or "force". Instead of validating each of these lines, it would be nice to simply wipe them all with one "yes/no" dialog per super-string.
I was thinking the best way to do this would be to start with a word from the hitlist, and record each unique occurrence of the super-string for that word (go until you are book-ended by why space) and go from there.
So on to the questions ...
What would be a good and efficient way to do this? My current idea
was to read in the file as a string, perform my counts, remove
repeat offenders from the the file input string, and output, but this is
proving to be a little more painful that I first suspected.
Would any specific data type/structure be preferred for this type of
work?
I have also thought about building the super-string count as I
create the HitOutput file, but I could not figure out a clean way of
doing this either. Any thoughts or suggestions?
A sample of the file I am reading in, and my code for reading in and traversing the hitlist and creating the HitOutput file below:
# Loop through hitlist list
while read -re hitlist || [[ -n "$hitlist" ]]
do
# If first character is "#" it's a comment, or line is blank, skip
if [ "$(echo $hitlist | head -c 1)" != "#" ]; then
if [ ! -z "$hitlist" -a "$histlist" != "" ]; then
# Parse comma delimited hitlist
IFS=',' read -ra categoryWords <<< "$hitlist"
# Search for occurrences/hits for each hit
for categoryWord in "${categoryWords[#]}"; do
# Append results to hit output string
eval 'find "$DIR" -type f -print0 | xargs -0 grep -HniI "$categoryWord"' >> HITOUTPUT
done
fi
fi
done < "$HITLIST"
src/fakescript.sh:1:Never going to win the war you mother!
src/open_source_licenses.txt:6147:May you share freely, never taking more than you give.
src/open_source_licenses.txt:8764:May you share freely, never taking more than you give.
src/open_source_licenses.txt:21711:No Third Party Beneficiaries. You agree that, except as otherwise expressly provided in this TOS, there shall be no third party beneficiaries to this Agreement. Waiver and Severability of Terms. The failure of UBM LLC to exercise or enforce any right or provision of the TOS shall
not constitute a waiver of such right or provision. If any provision of the TOS is found by a court of competent jurisdiction to be invalid, the parties nevertheless agree that the court should endeavor to give effect to the parties' intentions as reflected in the provision, and the other provisions of the TOS remain in full force and effect.
src/fakescript.sh:1:Never going to win the war you mother!
An example of my hitlist file:
# Comment out any category word lines that you do not want processed (the comma delimited lines)
# -----------------
# MEH
never,going,to give,you up
# ----------------
# blah
word to,your,mother
Let's divide this problem into two parts. First, we will update the hitlist interactively as requires by your customer. Second, we will find all matches to the updated hitlist.
1. Updating the hitlist
This searches for all words in files under directory dir that contain any word on the hitlist:
#!/bin/bash
grep -Erowhf <(sed -E 's/.*/([[:alpha:]]+&[[:alpha:]]*|[[:alpha:]]*&[[:alpha:]]+)/' hitlist) dir |
sort |
uniq -c |
while read n word
do
read -u 2 -p "$word occurs $n times. Include (y/n)? " a
[ "$a" = y ] && echo "$word" >>hitlist
done
This script runs interactively. As an example, suppose that dir contains these two files:
$ cat dir/file1.txt
for all foreign or catapult also cat.
The catapult hit the catermaran.
The form of a foreign formula
$ cat dir/file2.txt
dog and cat and formula, formula, formula
And hitlist contains two words:
$ cat hitlist
for
cat
If we then run our script, it looks like:
$ bash script.sh
catapult occurs 2 times. Include (y/n)? y
catermaran occurs 1 times. Include (y/n)? n
foreign occurs 2 times. Include (y/n)? y
form occurs 1 times. Include (y/n)? n
formula occurs 4 times. Include (y/n)? n
After the script is run, the file hitlist is updated with all the words that you want to include. We are now ready to proceed to the next step:
2. Finding matches to the updated hitlist
To read each word from a "hitlist" and search recursively for matches while ignoring, foreign even if the hitlist contains for, try:
grep -wrFf ../hitlist dir
-w tells grep to look only for full-words. Thus foreign will be ignored.
-r tells grep to search recursively.
-F tells grep to treat the hitlist as word, not regular expressions. (optional)
-f ../hitlist tells grep to read words from the file ../hitlist.
Following on with the example above, we would have:
$ grep -wrFf ./hitlist dir
dir/file2.txt:dog and cat and formula, formula, formula
dir/file1.txt:for all foreign or catapult also cat.
dir/file1.txt:The catapult hit the catermaran.
dir/file1.txt:The form of a foreign formula
If we don't want the file names displayed, use the -h option:
$ grep -hwrFf ./hitlist dir
dog and cat and formula, formula, formula
for all foreign or catapult also cat.
The catapult hit the catermaran.
The form of a foreign formula
Automatic update for counts 10 or less
#!/bin/bash
grep -Erowhf <(sed -E 's/.*/([[:alpha:]]+&[[:alpha:]]*|[[:alpha:]]*&[[:alpha:]]+)/' hitlist) dir |
sort |
uniq -c |
while read n word
do
a=y
[ "$n" -gt 10 ] && read -u 2 -p "$word occurs $n times. Include (y/n)? " a
[ "$a" = y ] && echo "$word" >>hitlist
done
Reformatting the customer's hitlist
I see that your customer's hitlist has extra formatting, including comments, empty lines, and duplicated words. For example:
$ cat hitlist.source
# MEH
never,going,to give,you up
# ----------------
# blah
word to,your,mother
To convert that to format useful here, try:
$ sed -E 's/#.*//; s/[[:space:],]+/\n/g; s/\n\n+/\n/g; /^$/d' hitlist.source | grep . | sort -u >hitlist
$ cat hitlist
give
going
mother
never
to
up
word
you
your

Bash - Concatenating backslash while joining an array

I've been trying to figure out a bash script to determine the server directory path, such as D:\xampp\htdocs, and the project folders name, such as "my_project", while Grunt is running my postinstall script. So far I can grab the projects folder name, and I can get an array of the remaining indices that comprise the server root path on my system, but I can't seem to join the array with an escaped backslash. This is probably not the best solution (definitely not the most elegant) so if you have any tips or suggestions along the way I'm amendable.
# Determine project folder name and server root directory path
bashFilePath=$0 # get path to post_install.sh
IFS='\' bashFilePathArray=($bashFilePath) # split path on \
len=${#bashFilePathArray[#]} # get array length
# Name of project folder in server root directory
projName=${bashFilePathArray[len-3]} # returns my_project
ndx=0
serverPath=""
while [ $ndx -le `expr $len - 4` ]
do
serverPath+="${bashFilePathArray[$ndx]}\\" # tried in and out of double quotes, also in separate concat below
(( ndx++ ))
done
echo $serverPath # returns D: xampp htdocs, works if you sub out \\ for anything else, such as / will produce D:/xampp/htdocs, just not \\
You can only prefix command invocations, not variable assignments, with IFS, so your line
IFS='\' bashFilePathArray=($bashFilePath)
is just a pair of assignments; the expansion of $bashFilePath is unaffected by the assignment to IFS. Instead, use the read builtin.
IFS='\' read -ra bashFilePathArray <<< "$bashFilePath"
Later, you can use a subshell to easily join the first few elements of the array into a single string.
serverPath=$(IFS='\'; echo "${bashFilePathArray[*]:0:len-3}")
The semi-colon is required, since the argument to echo is expanded before echo actually runs, meaning IFS needs to be modified "globally" rather than just for the echo command. Also, [*] is required in place of the more commonly recommended [#] because here we are making explicit use of the property that the elements of such an array expansion will produce a single word rather than a sequence of words.

Bash substring expansion on array

I have a set of files with a given suffix. For instance, I have a set of pdf files with suffix .pdf. I would like to obtain the names of the files without the suffix using substring expansion.
For a single file I can use:
file="test.pdf"
echo ${file:0 -4}
To do this operation for all files, I now tried:
files=( $(ls *.pdf) )
ff=( "${files[#]:0: -4}" )
echo ${ff[#]}
I now get an error saying that substring expression < 0..
( I would like to avoid using a for loop )
Use parameter expansions to remove the .pdf part like so:
shopt -s nullglob
files=( *.pdf )
echo "${files[#]%.pdf}"
The shopt -s nullglob is always a good idea when using globs: it will make the glob expand to nothing if there are no matches.
"${files[#]%.pdf}" will expand to an array with all the trailing .pdf removed. You can, if you wish put this in another array as so:
files_noext=( "${files[#]%.pdf}" )
All this is 100% safe regarding funny symbols in filenames (spaces, newlines, etc.), except for the echo part for files named -n.pdf, -e.pdf and -E.pdf... but the echo was just here for demonstration purposes. Your files=( $(ls *.pdf) ) is really really bad! Do never parse the output of ls.
To answer your comment: substring expansions don't work on each field of the array. Taken from the reference manual linked above:
${parameter:offset}
${parameter:offset:length}
If offset evaluates to a number less than zero, the value is used as an offset from the end of the value of parameter. If length evaluates to a number less than zero, and parameter is not # and not an indexed or associative array, it is interpreted as an offset from the end of the value of parameter rather than a number of characters, and the expansion is the characters between the two offsets. If parameter is #, the result is length positional parameters beginning at offset. If parameter is an indexed array name subscripted by # or *, the result is the length members of the array beginning with ${parameter[offset]}. A negative offset is taken relative to one greater than the maximum index of the specified array. Substring expansion applied to an associative array produces undefined results.
So, e.g.,
$ array=( zero one two three four five six seven eight )
$ echo "${array[#]:3:2}"
three four
$

Sorting by unique values of multiple fields in UNIX shell script

I am new to unix and would like to be able to do the following but am unsure how.
Take a text file with lines like:
TR=P567;dir=o;day=su;TI=12:10;stn=westborough;Line=worcester
TR=P567;dir=o;day=su;TI=12:10;stn=westborough;Line=lowell
TR=P567;dir=o;day=su;TI=12:10;stn=westborough;Line=worcester
TR=P234;dir=o;day=su;TI=12:10;stn=westborough;Line=lowell
TR=P234;dir=o;day=su;TI=12:10;stn=westborough;Line=lowell
TR=P234;dir=o;day=su;TI=12:10;stn=westborough;Line=worcester
And output this:
TR=P567;dir=o;day=su;TI=12:10;stn=westborough;Line=worcester
TR=P567;dir=o;day=su;TI=12:10;stn=westborough;Line=lowell
TR=P234;dir=o;day=su;TI=12:10;stn=westborough;Line=lowell
TR=P234;dir=o;day=su;TI=12:10;stn=westborough;Line=worcester
I would like the script to be able to find all all the lines for each TR value that have a unique Line value.
Thanks
Since you are apparently O.K. with randomly choosing among the values for dir, day, TI, and stn, you can write:
sort -u -t ';' -k 1,1 -k 6,6 -s < input_file > output_file
Explanation:
The sort utility, "sort lines of text files", lets you sort/compare/merge lines from files. (See the GNU Coreutils documentation.)
The -u or --unique option, "output only the first of an equal run", tells sort that if two input-lines are equal, then you only want one of them.
The -k POS[,POS2] or --key=POS1[,POS2] option, "start a key at POS1 (origin 1), end it at POS2 (default end of line)", tells sort where the "keys" are that we want to sort by. In our case, -k 1,1 means that one key consists of the first field (from field 1 through field 1), and -k 6,6 means that one key consists of the sixth field (from field 6 through field 6).
The -t SEP or --field-separator=SEP option tells sort that we want to use SEP — in our case, ';' — to separate and count fields. (Otherwise, it would think that fields are separated by whitespace, and in our case, it would treat the entire line as a single field.)
The -s or --stabilize option, "stabilize sort by disabling last-resort comparison", tells sort that we only want to compare lines in the way that we've specified; if two lines have the same above-defined "keys", then they're considered equivalent, even if they differ in other respects. Since we're using -u, that means that means that one of them will be discarded. (If we weren't using -u, it would just mean that sort wouldn't reorder them with respect to each other.)

Resources