Bash substring expansion on array - arrays

I have a set of files with a given suffix. For instance, I have a set of pdf files with suffix .pdf. I would like to obtain the names of the files without the suffix using substring expansion.
For a single file I can use:
file="test.pdf"
echo ${file:0 -4}
To do this operation for all files, I now tried:
files=( $(ls *.pdf) )
ff=( "${files[#]:0: -4}" )
echo ${ff[#]}
I now get an error saying that substring expression < 0..
( I would like to avoid using a for loop )

Use parameter expansions to remove the .pdf part like so:
shopt -s nullglob
files=( *.pdf )
echo "${files[#]%.pdf}"
The shopt -s nullglob is always a good idea when using globs: it will make the glob expand to nothing if there are no matches.
"${files[#]%.pdf}" will expand to an array with all the trailing .pdf removed. You can, if you wish put this in another array as so:
files_noext=( "${files[#]%.pdf}" )
All this is 100% safe regarding funny symbols in filenames (spaces, newlines, etc.), except for the echo part for files named -n.pdf, -e.pdf and -E.pdf... but the echo was just here for demonstration purposes. Your files=( $(ls *.pdf) ) is really really bad! Do never parse the output of ls.
To answer your comment: substring expansions don't work on each field of the array. Taken from the reference manual linked above:
${parameter:offset}
${parameter:offset:length}
If offset evaluates to a number less than zero, the value is used as an offset from the end of the value of parameter. If length evaluates to a number less than zero, and parameter is not # and not an indexed or associative array, it is interpreted as an offset from the end of the value of parameter rather than a number of characters, and the expansion is the characters between the two offsets. If parameter is #, the result is length positional parameters beginning at offset. If parameter is an indexed array name subscripted by # or *, the result is the length members of the array beginning with ${parameter[offset]}. A negative offset is taken relative to one greater than the maximum index of the specified array. Substring expansion applied to an associative array produces undefined results.
So, e.g.,
$ array=( zero one two three four five six seven eight )
$ echo "${array[#]:3:2}"
three four
$

Related

Splitting strings in nested loops and iterating through results bash [duplicate]

I try to list subfolders and save them as list to a variable
DIRS=($(find . -maxdepth 1 -mindepth 1 -type d ))
After that I want to do some work in subfolders. i.e.
for item in $DIRS
do
echo $item
But after
echo $DIRS
it gives only first item (subfolder). Could someone point me to error or propose another solution?
The following creates a bash array but lists only one of three subdirectories:
$ dirs=($(find . -maxdepth 1 -mindepth 1 -type d ))
$ echo $dirs
./subdir2
To see all the directories, you must use the [#] or [*] subscript form:
$ echo "${dirs[#]}"
./subdir2 ./subdir1 ./subdir3
Or, using it in a loop:
$ for item in "${dirs[#]}"; do echo $item; done
./subdir2
./subdir1
./subdir3
Avoiding problems from word splitting
Note that, in the code above, the shell performs word splitting before the array is created. Thus, this approach will fail if any subdirectories have whitespace in their names.
The following will successfully create an array of subdirectory names even if the names have spaces, tabs or newlines in them:
dirs=(*/)
If you need to use find and you want it to be safe for difficult file names, then use find's --exec option.
Documentation
The form $dirs returns just the first element of the array. This is documented in man bash:
Referencing an array variable without a subscript is equivalent to referencing the array with a subscript of 0.
The use of [#] and [*] is also documented in man bash:
Any element of an array may be referenced using ${name[subscript]}. The braces are required to avoid conflicts with pathname expansion. If subscript is # or , the word expands to all members of name. These subscripts differ only when the word appears within double quotes. If the word is double-quoted, ${name[]} expands to a single word with the value of each array member separated by the first character of the IFS special variable, and ${name[#]} expands each element of name to a separate word.

Why glob or braces expansion from a variable to an array is impossible when containing spaces [duplicate]

This question already has answers here:
Bash arbitrary glob pattern (with spaces) in for loop
(2 answers)
Closed 2 years ago.
I'm trying to use internal bash globs and braces expansion mechanism from a variable to an array.
path='./tmp2/tmp23/*'
expanded=($(eval echo "$(printf "%q" "${path}")"))
results:
declare -- path="./tmp2/tmp23/*"
declare -a expanded=([0]="./tmp2/tmp23/testfile" [1]="./tmp2/tmp23/testfile2" [2]="./tmp2/tmp23/testfile3" [3]="./tmp2/tmp23/testfile4" [4]="./tmp2/tmp23/tmp231")
This is working.
(I have 4 file testfileX and 1 folder in the ./tmp2/tmp23 folder)
Each file/folder inside an index of the array.
Now if my path contains spaces:
path='./tmp2/tmp2 3/*'
expanded=($(eval echo "$(printf "%q" "${path}")"))
Results
declare -- path="./tmp2/tmp2 3/*"
declare -a expanded=([0]="./tmp2/tmp2" [1]="3/")
Not working nothing is expanded and path is splitted due to IFS calvary.
Now with same path containing spaces but without glob:
path='./tmp2/tmp2 3/'
expanded=($(eval echo "$(printf "%q" "${path}"*)")) => added glob here outside ""
Results:
declare -a expanded=([0]="./tmp2/tmp2" [1]="3/testfile./tmp2/tmp2" [2]="3/testfile2./tmp2/tmp2" [3]="3/testfile3./tmp2/tmp2" [4]="3/testfile4./tmp2/tmp2" [5]="3/tmp231")
Path is expanded but results are false and splitted due to IFS.
Now with a quoted $(eval)
expanded=("$(eval echo "$(printf "%q" "${path}"*)")")
Results:
declare -a expanded=([0]="./tmp2/tmp2 3/testfile./tmp2/tmp2 3/testfile2./tmp2/tmp2 3/testfile3./tmp2/tmp2 3/testfile4./tmp2/tmp2 3/tmp231")
Now all expansion is done inside the same array index.
Why glob or braces expansion works inside a variable if there is no space ?
Why this is not working anymore when there is a space. Exactly the same code but just a space. Globs or braces expansion need to be outside double quotes. eval seems to have no effects.
Is there any other alternative to use (as read or mapfile or is it possible to escape space character) ?
I found this question how-to-assign-a-glob-expression-to-a-variable-in-a-bash-script but nothing about spaces.
Is there any way to expand a variable which contains globs or braces expansion parameters with spaces or without spaces to an array using the same method without word splitting when they contain spaces ?
Kind Regards
Don't use eval. Don't use a subshell. Just clear IFS.
path='./tmp2/tmp2 3/*'
oIFS=${IFS:-$' \t\n'} IFS='' # backup prior IFS value
expanded=( $path ) # perform expansion, unquoted
IFS=$oIFS # reset to original value, or an equivalent thereto
When you perform an unquoted expansion, two separate things happen in order:
All the characters found in the $IFS variable are used to split the string into words
Each word is then expanded as a separate glob.
The default value of IFS contains the space, the tab and the newline. If you don't want spaces, tabs and newlines to be treated as delimiters between words, then you need to modify that default.

bash: looping over the files with extra conditions

In the working directory there are several files grouped into several groups based on the end-suffix of the file name. Here is the example for 4 groups:
# group 1 has 5 files
NpXynWT_apo_300K_1.pdb
NpXynWT_apo_300K_2.pdb
NpXynWT_apo_300K_3.pdb
NpXynWT_apo_300K_4.pdb
NpXynWT_apo_300K_5.pdb
# group 2 has two files
NpXynWT_apo_340K_1.pdb
NpXynWT_apo_340K_2.pdb
# group 3 has 4 files
NpXynWT_com_300K_1.pdb
NpXynWT_com_300K_2.pdb
NpXynWT_com_300K_3.pdb
NpXynWT_com_300K_4.pdb
# group 4 has 1 file
NpXynWT_com_340K_1.pdb
I have wrote a simple bash workflow to
List item pre-process each of the fille via SED: add something within each of file
cat together the pre-processed files that belongs to the same group
Here is my script for the realisation of the workflow where I created an array with the names of the groups and looped it according to file index from 1 to 5
# list of 4 groups
systems=(NpXynWT_apo_300K NpXynWT_apo_340K NpXynWT_com_300K NpXynWT_com_340K)
# loop over the groups
for model in "${systems[#]}"; do
# loop over the files inside of each group
for i in {0001..0005}; do
# edit file via SED
sed -i "1 i\This is $i file of the group" "${pdbs}"/"${model}"_"$i"_FA.pdb
done
# after editing cat the pre-processed filles
cat "${pdbs}"/"${model}"_[1-5]_FA.pdb > "${output}/${model}.pdb"
done
The questions to improve this script:
1) how it would be possible to add within the inner (while) loop some checking conditions (e.g. by means of IF statement) to consider only existing files? In my example the script always loops 5 files (for each group) according to the maximum number in one of the group (here 5 files in the first group)
for i in {0001..0005}; do
I would rather to loop along all of the existing files of the given group and break the while loop in the case if the file does not exist (e.g. considering the 4th group with only 1 file). Here is the example, which however does not work properly
# loop over the groups with the checking of the presence of the file
for model in "${systems[#]}"; do
i="0"
# loop over the files inside of each group
for i in {0001..9999}; do
if [ ! -f "${pdbs}/${model}_00${i}_FA.pdb" ]; then
echo 'File '${pdbs}/${model}_00${i}_FA.pdb' does not exits!'
break
else
# edit file via SED
sed -i "1 i\This is $i file of the group" "${pdbs}"/"${model}"_00"$i"_FA.pdb
i=$[$i+1]
fi
done
done
Would it be possible to loop over any number of existing filles from the group (rather than just restricting to given e.g. very big number of files by
for i in {0001..9999}; do?
You can check if a file exists with the -f test, and break if it doesn't:
if [ ! -f "${pdbs}/${model}_${i}_FA.pdb" ]; then
break
fi
You existing cat command already does only count the existing files in each group, because "${pdbs}"/"${model}"_[1-5]_FA.pdb bash is performing filename expansion here, not simply expanding the [1-5] to all possible values. You can see this in the following example:
> touch f1 f2 f5 # files f3 and f4 do not exist
> echo f[1-5]
f1 f2 f5
Notice that f[1-5] did not expand to f1 f2 f3 f4 f5.
Update:
If you want your glob expression to match files ending in numbers bigger than 9, the [1-n] syntax will not work. The reason is that the [...] syntax defines a pattern that matches a single character. For instance, the expression foo[1-9] will match files foo1 through foo9, but not foo10 or foo99.
Doing something like foo[1-99] does not work, because it doesn't mean what you might think it means. The inside of the [] can contain any number of individual characters, or ranges of characters. For example, [1-9a-nxyz] would match any character from '1' through '9', from 'a' through 'n', or any of the characters 'x', 'y', or 'z', but it would not match '0', 'q', 'r', etc. Or for that matter, it would also not match any uppercase letters.
So [1-99] is not interpreted as the range of numbers from 1-99, it is interpreted as the set of characters comprised of the range from '1' to '9', plus the individual character '9'. Therefore the patterns [1-9] and [1-99] are equivalent, and will only match characters '1' through '9'. The second 9 in the latter expression is redundant.
However, you can still achieve what you want with extended globs, which you can enable with the command shopt -s extglob:
> touch f1 f2 f5 f99 f100000 f129828523
> echo f[1-99999999999] # Doesn't work like you want it to
f1 f2 f5
> shopt -s extglob
> echo f+([0-9])
f1 f2 f5 f99 f100000 f129828523
The +([0-9]) expression is an extended glob expression composed of two parts: the [0-9], whose meaning should be obvious at this point, and the enclosing +(...).
The +(pattern) syntax is an extglob expression that means match one or more instances of pattern. In this case, our pattern is [0-9], so the extglob expression +([0-9]) matches any string of digits 0-9.
However, you should note that this means it also matches things like 000000000. If you are only interested in numbers greater than or equal to 1, you would instead do (with extglob enabled):
> echo f[1-9]*([0-9])
Note the *(pattern) here instead of +(pattern). The * means match zero or more instances of pattern. Which we want because we've already matched the first digit with [1-9]. For instance, f[1-9]+([0-9]) does not match the filename f1.
You may not want to leave extglob enabled in your whole script, particularly if you have any regular glob expression elsewhere in your script that might accidentally be interpreted as an extglob expression. To disable extglob when you're done with it, do:
shopt -u extglob
There's one other important thing to note here. If a glob pattern doesn't match any files, then it is interpreted as a raw string, and is left unmodified.
For example:
> echo This_file_totally_does_not_exist*
This_file_totally_does_not_exist*
Or more to the point in your case, suppose there are zero files in your 4th case, e.g. there are no files containing NpXynWT_com_340K. In this case, if you try to use a glob containing NpXynWT_com_340K, you get the entire glob as a literal string:
> shopt -s extglob
> echo NpXynWT_com_340K_[1-9]*([0-9])
echo NpXynWT_com_340K_[1-9]*([0-9])
This is obviously not what you want, especially in the middle of your script where you are trying to cat the matching files. Luckily there is another option you can set to make non-matching globs expand to nothing:
> shopt -s nullglob
> echo This_file_totally_does_not_exist* # prints nothing
As with extglob, there may be unintended behavior elsewhere in your script if you leave nullglob on.

bash store output of command in array

I'm trying to find if the output of the following command, stores just one file in the array array_a
array_a = $(find /path/dir1 -maxdepth 1 -name file_orders?.csv)
echo $array_a
/path/dir1/file_orders1.csv /path/dir1/file_orders2.csv
echo ${#array_a[#]}
1
So it tell's me there's just one element, but obviously there are 2.
If I type echo ${array_a[0]} it doesn't return me anything. It's like, the variable array_a isn't an array at all. How can i force it to store the elements in array?
You are lacking the parentheses which define an array. But the fundamental problem is that running find inside backticks will split on whitespace, so if any matching file could contain a space, it will produce more than one element in the resulting array.
With -maxdepth 1 anyway, just use the shell's globbing facilities instead; you don't need find at all.
array_a=(/path/dir1/file_orders?.csv)
Also pay attention to quotes when using the array.
echo "${array_a[#]}"
Without the quotes, the whitespace splitting will happen again.

What is the difference between bash arrays with the notation ${array[*]} and ${array[#]} [duplicate]

I'm taking a stab at writing a bash completion for the first time, and I'm a bit confused about about the two ways of dereferencing bash arrays (${array[#]} and ${array[*]}).
Here's the relevant chunk of code (it works, but I would like to understand it better):
_switch()
{
local cur perls
local ROOT=${PERLBREW_ROOT:-$HOME/perl5/perlbrew}
COMPREPLY=()
cur=${COMP_WORDS[COMP_CWORD]}
perls=($ROOT/perls/perl-*)
# remove all but the final part of the name
perls=(${perls[*]##*/})
COMPREPLY=( $( compgen -W "${perls[*]} /usr/bin/perl" -- ${cur} ) )
}
bash's documentation says:
Any element of an array may be referenced using ${name[subscript]}. The braces are required to avoid conflicts with the shell's filename expansion operators. If the subscript is ‘#’ or ‘*’, the word expands to all members of the array name. These subscripts differ only when the word appears within double quotes. If the word is double-quoted, ${name[*]} expands to a single word with the value of each array member separated by the first character of the IFS variable, and ${name[#]} expands each element of name to a separate word.
Now I think I understand that compgen -W expects a string containing a wordlist of possible alternatives, but in this context I don't understand what "${name[#]} expands each element of name to a separate word" means.
Long story short: ${array[*]} works; ${array[#]} doesn't. I would like to know why, and I would like to understand better what exactly ${array[#]} expands into.
(This is an expansion of my comment on Kaleb Pederson's answer -- see that answer for a more general treatment of [#] vs [*].)
When bash (or any similar shell) parses a command line, it splits it into a series of "words" (which I will call "shell-words" to avoid confusion later). Generally, shell-words are separated by spaces (or other whitespace), but spaces can be included in a shell-word by escaping or quoting them. The difference between [#] and [*]-expanded arrays in double-quotes is that "${myarray[#]}" leads to each element of the array being treated as a separate shell-word, while "${myarray[*]}" results in a single shell-word with all of the elements of the array separated by spaces (or whatever the first character of IFS is).
Usually, the [#] behavior is what you want. Suppose we have perls=(perl-one perl-two) and use ls "${perls[*]}" -- that's equivalent to ls "perl-one perl-two", which will look for single file named perl-one perl-two, which is probably not what you wanted. ls "${perls[#]}" is equivalent to ls "perl-one" "perl-two", which is much more likely to do something useful.
Providing a list of completion words (which I will call comp-words to avoid confusion with shell-words) to compgen is different; the -W option takes a list of comp-words, but it must be in the form of a single shell-word with the comp-words separated by spaces. Note that command options that take arguments always (at least as far as I know) take a single shell-word -- otherwise there'd be no way to tell when the arguments to the option end, and the regular command arguments (/other option flags) begin.
In more detail:
perls=(perl-one perl-two)
compgen -W "${perls[*]} /usr/bin/perl" -- ${cur}
is equivalent to:
compgen -W "perl-one perl-two /usr/bin/perl" -- ${cur}
...which does what you want. On the other hand,
perls=(perl-one perl-two)
compgen -W "${perls[#]} /usr/bin/perl" -- ${cur}
is equivalent to:
compgen -W "perl-one" "perl-two /usr/bin/perl" -- ${cur}
...which is complete nonsense: "perl-one" is the only comp-word attached to the -W flag, and the first real argument -- which compgen will take as the string to be completed -- is "perl-two /usr/bin/perl". I'd expect compgen to complain that it's been given extra arguments ("--" and whatever's in $cur), but apparently it just ignores them.
Your title asks about ${array[#]} versus ${array[*]} (both within {}) but then you ask about $array[*] versus $array[#] (both without {}) which is a bit confusing. I'll answer both (within {}):
When you quote an array variable and use # as a subscript, each element of the array is expanded to its full content regardless of whitespace (actually, one of $IFS) that may be present within that content. When you use the asterisk (*) as the subscript (regardless of whether it's quoted or not) it may expand to new content created by breaking up each array element's content at $IFS.
Here's the example script:
#!/bin/sh
myarray[0]="one"
myarray[1]="two"
myarray[3]="three four"
echo "with quotes around myarray[*]"
for x in "${myarray[*]}"; do
echo "ARG[*]: '$x'"
done
echo "with quotes around myarray[#]"
for x in "${myarray[#]}"; do
echo "ARG[#]: '$x'"
done
echo "without quotes around myarray[*]"
for x in ${myarray[*]}; do
echo "ARG[*]: '$x'"
done
echo "without quotes around myarray[#]"
for x in ${myarray[#]}; do
echo "ARG[#]: '$x'"
done
And here's it's output:
with quotes around myarray[*]
ARG[*]: 'one two three four'
with quotes around myarray[#]
ARG[#]: 'one'
ARG[#]: 'two'
ARG[#]: 'three four'
without quotes around myarray[*]
ARG[*]: 'one'
ARG[*]: 'two'
ARG[*]: 'three'
ARG[*]: 'four'
without quotes around myarray[#]
ARG[#]: 'one'
ARG[#]: 'two'
ARG[#]: 'three'
ARG[#]: 'four'
I personally usually want "${myarray[#]}". Now, to answer the second part of your question, ${array[#]} versus $array[#].
Quoting the bash docs, which you quoted:
The braces are required to avoid conflicts with the shell's filename expansion operators.
$ myarray=
$ myarray[0]="one"
$ myarray[1]="two"
$ echo ${myarray[#]}
one two
But, when you do $myarray[#], the dollar sign is tightly bound to myarray so it is evaluated before the [#]. For example:
$ ls $myarray[#]
ls: cannot access one[#]: No such file or directory
But, as noted in the documentation, the brackets are for filename expansion, so let's try this:
$ touch one#
$ ls $myarray[#]
one#
Now we can see that the filename expansion happened after the $myarray exapansion.
And one more note, $myarray without a subscript expands to the first value of the array:
$ myarray[0]="one four"
$ echo $myarray[5]
one four[5]

Resources