Regular expression for GCC Pre Processor Line markers - c

Is there any Bash or Python regular expression for below C pre processor line marker ?
C Pre Processor output line markers as follows:
# 74 "a/b/some_file.c" 3 4
First comes the symbol - #
space
then line number - 74
space
then - "file path"
space
then Zero or more integers separated by space - 3 4
More info on:
https://gcc.gnu.org/onlinedocs/gcc-6.4.0/cpp/Preprocessor-Output.html

I dunno where you'd want Bash to match such lines, and Python undoubtedly could do it with its regexes. I have a sed script which I use to convert #line directives into comments (which can sometimes make debugging preprocessed code easier). It looks like this:
#!/bin/sh
#
# #(#)$Id: linecomments.sh,v 1.3 2018/01/05 05:12:10 jleffler Exp $
#
# Convert #line directives into comments
# Deals with four forms of the #line directive:
# # line 99 "file"
# # line 99
# # 99 "file"
# # 99
exec sed \
-e 's%^[[:space:]]*#[[:space:]]*line[[:space:]][0-9][0-9]*[[:space:]].*%/*&*/%' \
-e 's%^[[:space:]]*#[[:space:]]*[0-9][0-9]*[[:space:]].*%/*&*/%' \
-e 's%^[[:space:]]*#[[:space:]]*line[[:space:]][0-9][0-9]*$%/*&*/%' \
-e 's%^[[:space:]]*#[[:space:]]*[0-9][0-9]*$%/*&*/%' \
"$#"
I use % to delimit the regular expressions since I need /* and */ in the replacement text. The regexes match 'trailing junk' such as the extra numbers emitted by GCC.
The edit in 2018 replaced fragments that looked like [  ] with [[:space:]] — the old code (dated 2001) used a blank and a tab in each case. This is clearer; you can copy'n'paste without having to worry about where there are tabs.

Related

Remove some arguments from argument string in zsh

I'm trying to remove part of an arguments string using zsh parameter expansion (no external tools like sed please). Here's what for:
The RUBYOPT environment variable contains arguments which are applied whenever the ruby interpreter is used just as if they were given along with the ruby command. One argument controls the warning verbosity, possible settings are for instance -W0 or -W:no-deprecated. My goal is to remove all all -W... from RUBYOPT, say:
-W0 -X -> -X
-W:no-deprecated -X -W1 -> -X
My current approach is to split the string to an array and then make a substitution on every member of the array. This works on two lines of code, but I can't make it work on a single line of code:
% RUBYOPT="-W:no-deprecated -X -W1"
% parts=(${(#s: :)RUBYOPT})
% echo ${parts/-W*}
-X
% echo ${(${(#s: :)RUBYOPT})/-W*}
zsh: error in flags
What am I doing wrong here... or is there a different, more elegant way to achieve this?
Thanks for your hints!
${(... introduces parameter expansion flags (for expample:${(s: :)...}).
It cannot handle ${(${(#s: :... as a parameter expansion, especially as the parameter expansion flags for the (${(#s... part, so zsh yields an error "zsh: error in flags".
% RUBYOPT="-W:no-deprecated -X -W1"
% print -- ${${(s: :)RUBYOPT}/-W*}
# -X
could rescue.
update from rowboat's comments: it could be inappropriate for some flags like -abc-Whoops or -foo-Whoo etc:
% RUBYOPT="-W:no-deprecated -X -W1 -foo-Whoo"
% parts=(${(s: :)RUBYOPT})
% print -- ${parts/-W*}
# -X -foo
# Note: -foo would be unexpected
% print -- ${${(s: :)RUBYOPT}/-W*}
# -X -foo
# Note: -foo would be unexpected
The s globbing flag (along with the shell option EXTENDED_GLOB) could rescue:
% RUBYOPT="-W:no-deprecated -X -W1 -foo-Whoo"
% parts=(${(s: :)RUBYOPT})
% setopt extendedglob
# To use `(#s)` flag which is like regex's `^`
% print -- ${parts/(#s)-W*}
# -X -foo-Whoo
% print -- ${${(s: :)RUBYOPT}/(#s)-W*}
# -X -foo-Whoo
Globbing Flags
There are various flags which affect any text to their right up to the end of the enclosing group or to the end of the pattern; they require the EXTENDED_GLOB option. All take the form (#X) where X may have one of the following forms:
...
s, e
Unlike the other flags, these have only a local effect, and each must appear on its own: (#s) and (#e) are the only valid forms. The (#s) flag succeeds only at the start of the test string, and the (#e) flag succeeds only at the end of the test string; they correspond to ^ and $ in standard regular ex‐ pressions.
...
--- zshexpn(1), Expansion, Globbing Flags
Or ${name#:pattern} syntax described below could rescue, too.
end update from rowboat's comments
Use typeset -T feature to manipulate the scalar value by array operators is an option.
RUBYOPT="-W:no-deprecated -X -W1"
typeset -xT RUBYOPT rubyopt ' '
rubyopt=(${rubyopt:#-W*})
print -l -- "$RUBYOPT"
# -X
typeset
...
-T [ SCALAR[=VALUE] ARRAY[=(VALUE ...)] [ SEP ] ]
...
the -T option requires zero, two, or three arguments to be present. With no arguments, the list of parameters created in this fashion is shown. With two or three arguments, the first two are the name of a scalar and of an array parameter (in that order) that will be tied together in the manner of $PATH and $path. The optional third argument is a single-character separator which will be used to join the elements of the array to form the scalar; if absent, a colon is used, as with $PATH. Only the first character of the separator is significant; any remaining characters are ignored. Multibyte characters are not yet supported.
...
Both the scalar and the array may be manipulated as normal. If one is unset, the other will automatically be unset too.
...
--- zshbuiltin(1), Shell Bultin Commands, typeset
And rubyopt=(${rubyopt:#-W*}) to filter the array elements
${name:#pattern}
If the pattern matches the value of name, then substitute the empty string; otherwise, just substitute the value of name. If name is an array the matching array elements are removed (use the (M) flag to remove the non-matched elements).
--- zshexpn(1), Parameter Expansion , ${name:#pattern}
Note: It is possible to omit "#" from flags because the empty values are not necessary in this case.
RUBYOPT="-W:no-deprecated -X -W1"
parts=(${(s: :)RUBYOPT})
print -- ${parts/-W*}
# -X
print -- ${${(s: :)RUBYOPT}/-W*}
# -X
Parameter Expansion Flags
...
#
In double quotes, array elements are put into separate words. E.g., "${(#)foo}" is equivalent to "${foo[#]}" and "${(#)foo[1,2]}" is the same as "$foo[1]" "$foo[2]". This is distinct from field splitting by the f, s or z flags, which still applies within each array element.
--- zshexpn(1), Parameter Expansion Flags, #
If we cannot omit the empty value, ${name:#pattern} syntax could rescue.
RUBYOPT="-W:no-deprecated -X -W1"
parts=("${(#s: :)RUBYOPT}")
# parts=("-W:no-deprecated" "" "-X" "-W1")
# Note the empty value are retained
print -rC1 -- "${(#qqq)parts:#-W*}"
# ""
# "-X"
print -rC1 -- "${(#qqq)${(#s: :)RUBYOPT}:#-W*}"
# ""
# "-X"

call program with arguments from an array containing items from another array wrapped in double quotes

(This is a more specific version of the problem discussed in bash - expand arguments from array containing double quotes
.)
I want bash to call cmake with arguments from an array with double quotes which itself contain items from another array. Here is an example for clarification:
cxx_flags=()
cxx_flags+=(-fdiagnostics-color)
cxx_flags+=(-O3)
cmake_arguments=()
cmake_arguments+=(-DCMAKE_BUILD_TYPE=Release)
cmake_arguments+=("-DCMAKE_CXX_FLAGS=\"${cxx_flags[#]}\"")
The arguments shall be printed pretty like this:
$ echo "CMake arguments: ${cmake_arguments[#]}"
CMake arguments: -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-fdiagnostics-color -O3"
Problem
And finally cmake should be called (this does not work!):
cmake .. "${cmake_arguments[#]}"
It expands to (as set -x produces):
cmake .. -DCMAKE_BUILD_TYPE=Release '-DCMAKE_CXX_FLAGS="-fdiagnostics-color' '-O3"'
Workaround
echo "cmake .. ${cmake_arguments[#]}" | source /dev/stdin
Expands to:
cmake .. -DCMAKE_BUILD_TYPE=Release '-DCMAKE_CXX_FLAGS=-fdiagnostics-color -O3'
That's okay but it seems like a hack. Is there a better solution?
Update
If you want to iterate over the array you should use one more variable (as randomir and Jeff Breadner suggested):
cxx_flags=()
cxx_flags+=(-fdiagnostics-color)
cxx_flags+=(-O3)
cxx_flags_string="${cxx_flags[#]}"
cmake_arguments=()
cmake_arguments+=(-DCMAKE_BUILD_TYPE=Release)
cmake_arguments+=("-DCMAKE_CXX_FLAGS=\"$cxx_flags_string\"")
The core problem remains (and the workaround still works) but you could iterate over cmake_arguments and see two items (as intended) instead of three (-DCMAKE_BUILD_TYPE=Release, -DCMAKE_CXX_FLAGS="-fdiagnostics-color and -O3"):
echo "cmake .. \\"
size=${#cmake_arguments[#]}
for ((i = 0; i < $size; ++i)); do
if [[ $(($i + 1)) -eq $size ]]; then
echo " ${cmake_arguments[$i]}"
else
echo " ${cmake_arguments[$i]} \\"
fi
done
Prints:
cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_FLAGS="-fdiagnostics-color -O3"
It seems that there's another layer of parsing that has to happen before cmake is happy; the | source /dev/stdin handles this, but you could also just move your CXX flags through an additional variable:
#!/bin/bash -x
cxx_flags=()
cxx_flags+=(-fdiagnostics-color)
cxx_flags+=(-O3)
CXX_FLAGS="${cxx_flags[#]}"
cmake_arguments=()
cmake_arguments+=(-DCMAKE_BUILD_TYPE=Release)
cmake_arguments+=("'-DCMAKE_CXX_FLAGS=${CXX_FLAGS}'")
CMAKE_ARGUMENTS="${cmake_arguments[#]}"
echo "CMake arguments: ${CMAKE_ARGUMENTS}"
returns:
+ cxx_flags=()
+ cxx_flags+=(-fdiagnostics-color)
+ cxx_flags+=(-O3)
+ CXX_FLAGS='-fdiagnostics-color -O3'
+ cmake_arguments=()
+ cmake_arguments+=(-DCMAKE_BUILD_TYPE=Release)
+ cmake_arguments+=("'-DCMAKE_CXX_FLAGS=${CXX_FLAGS}'")
+ CMAKE_ARGUMENTS='-DCMAKE_BUILD_TYPE=Release '\''-DCMAKE_CXX_FLAGS=-fdiagnostics-color -O3'\'''
+ echo 'CMake arguments: -DCMAKE_BUILD_TYPE=Release '\''-DCMAKE_CXX_FLAGS=-fdiagnostics-color -O3'\'''
CMake arguments: -DCMAKE_BUILD_TYPE=Release '-DCMAKE_CXX_FLAGS=-fdiagnostics-color -O3'
There is probably a cleaner solution still, but this is better than the | source /dev/stdin thing, I think.
You basically want cxx_flags array expanded into a single word.
This:
cxx_flags=()
cxx_flags+=(-fdiagnostics-color)
cxx_flags+=(-O3)
flags="${cxx_flags[#]}"
cmake_arguments=()
cmake_arguments+=(-DCMAKE_BUILD_TYPE=Release)
cmake_arguments+=(-DCMAKE_CXX_FLAGS="$flags")
will produce the output you want:
$ set -x
$ echo "${cmake_arguments[#]}"
+ echo -DCMAKE_BUILD_TYPE=Release '-DCMAKE_CXX_FLAGS=-fdiagnostics-color -O3'
So, to summarize, running:
cmake .. "${cmake_arguments[#]}"
with array expansion quoted, ensures each array element (cmake argument) is expanded as only one word (if it contains spaces, the shell won't print quotes around it, but the command executed will receive the whole string as a single argument). You can verify that with set -x.
If you need to print the complete command with arguments in a way that can be reused by copy/pasting, you can consider using printf with %q format specifier, which will quote the argument in a way that can be reused as shell input:
$ printf "cmake .. "; printf "%q " "${cmake_arguments[#]}"; printf "\n"
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS=-fdiagnostics-color\ -O3
Note the backslash which escapes the space.

Split string into an array and count elements in Bash [duplicate]

In a Bash script, I would like to split a line into pieces and store them in an array.
For example, given the line:
Paris, France, Europe
I would like to have the resulting array to look like so:
array[0] = Paris
array[1] = France
array[2] = Europe
A simple implementation is preferable; speed does not matter. How can I do it?
IFS=', ' read -r -a array <<< "$string"
Note that the characters in $IFS are treated individually as separators so that in this case fields may be separated by either a comma or a space rather than the sequence of the two characters. Interestingly though, empty fields aren't created when comma-space appears in the input because the space is treated specially.
To access an individual element:
echo "${array[0]}"
To iterate over the elements:
for element in "${array[#]}"
do
echo "$element"
done
To get both the index and the value:
for index in "${!array[#]}"
do
echo "$index ${array[index]}"
done
The last example is useful because Bash arrays are sparse. In other words, you can delete an element or add an element and then the indices are not contiguous.
unset "array[1]"
array[42]=Earth
To get the number of elements in an array:
echo "${#array[#]}"
As mentioned above, arrays can be sparse so you shouldn't use the length to get the last element. Here's how you can in Bash 4.2 and later:
echo "${array[-1]}"
in any version of Bash (from somewhere after 2.05b):
echo "${array[#]: -1:1}"
Larger negative offsets select farther from the end of the array. Note the space before the minus sign in the older form. It is required.
All of the answers to this question are wrong in one way or another.
Wrong answer #1
IFS=', ' read -r -a array <<< "$string"
1: This is a misuse of $IFS. The value of the $IFS variable is not taken as a single variable-length string separator, rather it is taken as a set of single-character string separators, where each field that read splits off from the input line can be terminated by any character in the set (comma or space, in this example).
Actually, for the real sticklers out there, the full meaning of $IFS is slightly more involved. From the bash manual:
The shell treats each character of IFS as a delimiter, and splits the results of the other expansions into words using these characters as field terminators. If IFS is unset, or its value is exactly <space><tab><newline>, the default, then sequences of <space>, <tab>, and <newline> at the beginning and end of the results of the previous expansions are ignored, and any sequence of IFS characters not at the beginning or end serves to delimit words. If IFS has a value other than the default, then sequences of the whitespace characters <space>, <tab>, and <newline> are ignored at the beginning and end of the word, as long as the whitespace character is in the value of IFS (an IFS whitespace character). Any character in IFS that is not IFS whitespace, along with any adjacent IFS whitespace characters, delimits a field. A sequence of IFS whitespace characters is also treated as a delimiter. If the value of IFS is null, no word splitting occurs.
Basically, for non-default non-null values of $IFS, fields can be separated with either (1) a sequence of one or more characters that are all from the set of "IFS whitespace characters" (that is, whichever of <space>, <tab>, and <newline> ("newline" meaning line feed (LF)) are present anywhere in $IFS), or (2) any non-"IFS whitespace character" that's present in $IFS along with whatever "IFS whitespace characters" surround it in the input line.
For the OP, it's possible that the second separation mode I described in the previous paragraph is exactly what he wants for his input string, but we can be pretty confident that the first separation mode I described is not correct at all. For example, what if his input string was 'Los Angeles, United States, North America'?
IFS=', ' read -ra a <<<'Los Angeles, United States, North America'; declare -p a;
## declare -a a=([0]="Los" [1]="Angeles" [2]="United" [3]="States" [4]="North" [5]="America")
2: Even if you were to use this solution with a single-character separator (such as a comma by itself, that is, with no following space or other baggage), if the value of the $string variable happens to contain any LFs, then read will stop processing once it encounters the first LF. The read builtin only processes one line per invocation. This is true even if you are piping or redirecting input only to the read statement, as we are doing in this example with the here-string mechanism, and thus unprocessed input is guaranteed to be lost. The code that powers the read builtin has no knowledge of the data flow within its containing command structure.
You could argue that this is unlikely to cause a problem, but still, it's a subtle hazard that should be avoided if possible. It is caused by the fact that the read builtin actually does two levels of input splitting: first into lines, then into fields. Since the OP only wants one level of splitting, this usage of the read builtin is not appropriate, and we should avoid it.
3: A non-obvious potential issue with this solution is that read always drops the trailing field if it is empty, although it preserves empty fields otherwise. Here's a demo:
string=', , a, , b, c, , , '; IFS=', ' read -ra a <<<"$string"; declare -p a;
## declare -a a=([0]="" [1]="" [2]="a" [3]="" [4]="b" [5]="c" [6]="" [7]="")
Maybe the OP wouldn't care about this, but it's still a limitation worth knowing about. It reduces the robustness and generality of the solution.
This problem can be solved by appending a dummy trailing delimiter to the input string just prior to feeding it to read, as I will demonstrate later.
Wrong answer #2
string="1:2:3:4:5"
set -f # avoid globbing (expansion of *).
array=(${string//:/ })
Similar idea:
t="one,two,three"
a=($(echo $t | tr ',' "\n"))
(Note: I added the missing parentheses around the command substitution which the answerer seems to have omitted.)
Similar idea:
string="1,2,3,4"
array=(`echo $string | sed 's/,/\n/g'`)
These solutions leverage word splitting in an array assignment to split the string into fields. Funnily enough, just like read, general word splitting also uses the $IFS special variable, although in this case it is implied that it is set to its default value of <space><tab><newline>, and therefore any sequence of one or more IFS characters (which are all whitespace characters now) is considered to be a field delimiter.
This solves the problem of two levels of splitting committed by read, since word splitting by itself constitutes only one level of splitting. But just as before, the problem here is that the individual fields in the input string can already contain $IFS characters, and thus they would be improperly split during the word splitting operation. This happens to not be the case for any of the sample input strings provided by these answerers (how convenient...), but of course that doesn't change the fact that any code base that used this idiom would then run the risk of blowing up if this assumption were ever violated at some point down the line. Once again, consider my counterexample of 'Los Angeles, United States, North America' (or 'Los Angeles:United States:North America').
Also, word splitting is normally followed by filename expansion (aka pathname expansion aka globbing), which, if done, would potentially corrupt words containing the characters *, ?, or [ followed by ] (and, if extglob is set, parenthesized fragments preceded by ?, *, +, #, or !) by matching them against file system objects and expanding the words ("globs") accordingly. The first of these three answerers has cleverly undercut this problem by running set -f beforehand to disable globbing. Technically this works (although you should probably add set +f afterward to reenable globbing for subsequent code which may depend on it), but it's undesirable to have to mess with global shell settings in order to hack a basic string-to-array parsing operation in local code.
Another issue with this answer is that all empty fields will be lost. This may or may not be a problem, depending on the application.
Note: If you're going to use this solution, it's better to use the ${string//:/ } "pattern substitution" form of parameter expansion, rather than going to the trouble of invoking a command substitution (which forks the shell), starting up a pipeline, and running an external executable (tr or sed), since parameter expansion is purely a shell-internal operation. (Also, for the tr and sed solutions, the input variable should be double-quoted inside the command substitution; otherwise word splitting would take effect in the echo command and potentially mess with the field values. Also, the $(...) form of command substitution is preferable to the old `...` form since it simplifies nesting of command substitutions and allows for better syntax highlighting by text editors.)
Wrong answer #3
str="a, b, c, d" # assuming there is a space after ',' as in Q
arr=(${str//,/}) # delete all occurrences of ','
This answer is almost the same as #2. The difference is that the answerer has made the assumption that the fields are delimited by two characters, one of which being represented in the default $IFS, and the other not. He has solved this rather specific case by removing the non-IFS-represented character using a pattern substitution expansion and then using word splitting to split the fields on the surviving IFS-represented delimiter character.
This is not a very generic solution. Furthermore, it can be argued that the comma is really the "primary" delimiter character here, and that stripping it and then depending on the space character for field splitting is simply wrong. Once again, consider my counterexample: 'Los Angeles, United States, North America'.
Also, again, filename expansion could corrupt the expanded words, but this can be prevented by temporarily disabling globbing for the assignment with set -f and then set +f.
Also, again, all empty fields will be lost, which may or may not be a problem depending on the application.
Wrong answer #4
string='first line
second line
third line'
oldIFS="$IFS"
IFS='
'
IFS=${IFS:0:1} # this is useful to format your code with tabs
lines=( $string )
IFS="$oldIFS"
This is similar to #2 and #3 in that it uses word splitting to get the job done, only now the code explicitly sets $IFS to contain only the single-character field delimiter present in the input string. It should be repeated that this cannot work for multicharacter field delimiters such as the OP's comma-space delimiter. But for a single-character delimiter like the LF used in this example, it actually comes close to being perfect. The fields cannot be unintentionally split in the middle as we saw with previous wrong answers, and there is only one level of splitting, as required.
One problem is that filename expansion will corrupt affected words as described earlier, although once again this can be solved by wrapping the critical statement in set -f and set +f.
Another potential problem is that, since LF qualifies as an "IFS whitespace character" as defined earlier, all empty fields will be lost, just as in #2 and #3. This would of course not be a problem if the delimiter happens to be a non-"IFS whitespace character", and depending on the application it may not matter anyway, but it does vitiate the generality of the solution.
So, to sum up, assuming you have a one-character delimiter, and it is either a non-"IFS whitespace character" or you don't care about empty fields, and you wrap the critical statement in set -f and set +f, then this solution works, but otherwise not.
(Also, for information's sake, assigning a LF to a variable in bash can be done more easily with the $'...' syntax, e.g. IFS=$'\n';.)
Wrong answer #5
countries='Paris, France, Europe'
OIFS="$IFS"
IFS=', ' array=($countries)
IFS="$OIFS"
Similar idea:
IFS=', ' eval 'array=($string)'
This solution is effectively a cross between #1 (in that it sets $IFS to comma-space) and #2-4 (in that it uses word splitting to split the string into fields). Because of this, it suffers from most of the problems that afflict all of the above wrong answers, sort of like the worst of all worlds.
Also, regarding the second variant, it may seem like the eval call is completely unnecessary, since its argument is a single-quoted string literal, and therefore is statically known. But there's actually a very non-obvious benefit to using eval in this way. Normally, when you run a simple command which consists of a variable assignment only, meaning without an actual command word following it, the assignment takes effect in the shell environment:
IFS=', '; ## changes $IFS in the shell environment
This is true even if the simple command involves multiple variable assignments; again, as long as there's no command word, all variable assignments affect the shell environment:
IFS=', ' array=($countries); ## changes both $IFS and $array in the shell environment
But, if the variable assignment is attached to a command name (I like to call this a "prefix assignment") then it does not affect the shell environment, and instead only affects the environment of the executed command, regardless whether it is a builtin or external:
IFS=', ' :; ## : is a builtin command, the $IFS assignment does not outlive it
IFS=', ' env; ## env is an external command, the $IFS assignment does not outlive it
Relevant quote from the bash manual:
If no command name results, the variable assignments affect the current shell environment. Otherwise, the variables are added to the environment of the executed command and do not affect the current shell environment.
It is possible to exploit this feature of variable assignment to change $IFS only temporarily, which allows us to avoid the whole save-and-restore gambit like that which is being done with the $OIFS variable in the first variant. But the challenge we face here is that the command we need to run is itself a mere variable assignment, and hence it would not involve a command word to make the $IFS assignment temporary. You might think to yourself, well why not just add a no-op command word to the statement like the : builtin to make the $IFS assignment temporary? This does not work because it would then make the $array assignment temporary as well:
IFS=', ' array=($countries) :; ## fails; new $array value never escapes the : command
So, we're effectively at an impasse, a bit of a catch-22. But, when eval runs its code, it runs it in the shell environment, as if it was normal, static source code, and therefore we can run the $array assignment inside the eval argument to have it take effect in the shell environment, while the $IFS prefix assignment that is prefixed to the eval command will not outlive the eval command. This is exactly the trick that is being used in the second variant of this solution:
IFS=', ' eval 'array=($string)'; ## $IFS does not outlive the eval command, but $array does
So, as you can see, it's actually quite a clever trick, and accomplishes exactly what is required (at least with respect to assignment effectation) in a rather non-obvious way. I'm actually not against this trick in general, despite the involvement of eval; just be careful to single-quote the argument string to guard against security threats.
But again, because of the "worst of all worlds" agglomeration of problems, this is still a wrong answer to the OP's requirement.
Wrong answer #6
IFS=', '; array=(Paris, France, Europe)
IFS=' ';declare -a array=(Paris France Europe)
Um... what? The OP has a string variable that needs to be parsed into an array. This "answer" starts with the verbatim contents of the input string pasted into an array literal. I guess that's one way to do it.
It looks like the answerer may have assumed that the $IFS variable affects all bash parsing in all contexts, which is not true. From the bash manual:
IFS The Internal Field Separator that is used for word splitting after expansion and to split lines into words with the read builtin command. The default value is <space><tab><newline>.
So the $IFS special variable is actually only used in two contexts: (1) word splitting that is performed after expansion (meaning not when parsing bash source code) and (2) for splitting input lines into words by the read builtin.
Let me try to make this clearer. I think it might be good to draw a distinction between parsing and execution. Bash must first parse the source code, which obviously is a parsing event, and then later it executes the code, which is when expansion comes into the picture. Expansion is really an execution event. Furthermore, I take issue with the description of the $IFS variable that I just quoted above; rather than saying that word splitting is performed after expansion, I would say that word splitting is performed during expansion, or, perhaps even more precisely, word splitting is part of the expansion process. The phrase "word splitting" refers only to this step of expansion; it should never be used to refer to the parsing of bash source code, although unfortunately the docs do seem to throw around the words "split" and "words" a lot. Here's a relevant excerpt from the linux.die.net version of the bash manual:
Expansion is performed on the command line after it has been split into words. There are seven kinds of expansion performed: brace expansion, tilde expansion, parameter and variable expansion, command substitution, arithmetic expansion, word splitting, and pathname expansion.
The order of expansions is: brace expansion; tilde expansion, parameter and variable expansion, arithmetic expansion, and command substitution (done in a left-to-right fashion); word splitting; and pathname expansion.
You could argue the GNU version of the manual does slightly better, since it opts for the word "tokens" instead of "words" in the first sentence of the Expansion section:
Expansion is performed on the command line after it has been split into tokens.
The important point is, $IFS does not change the way bash parses source code. Parsing of bash source code is actually a very complex process that involves recognition of the various elements of shell grammar, such as command sequences, command lists, pipelines, parameter expansions, arithmetic substitutions, and command substitutions. For the most part, the bash parsing process cannot be altered by user-level actions like variable assignments (actually, there are some minor exceptions to this rule; for example, see the various compatxx shell settings, which can change certain aspects of parsing behavior on-the-fly). The upstream "words"/"tokens" that result from this complex parsing process are then expanded according to the general process of "expansion" as broken down in the above documentation excerpts, where word splitting of the expanded (expanding?) text into downstream words is simply one step of that process. Word splitting only touches text that has been spit out of a preceding expansion step; it does not affect literal text that was parsed right off the source bytestream.
Wrong answer #7
string='first line
second line
third line'
while read -r line; do lines+=("$line"); done <<<"$string"
This is one of the best solutions. Notice that we're back to using read. Didn't I say earlier that read is inappropriate because it performs two levels of splitting, when we only need one? The trick here is that you can call read in such a way that it effectively only does one level of splitting, specifically by splitting off only one field per invocation, which necessitates the cost of having to call it repeatedly in a loop. It's a bit of a sleight of hand, but it works.
But there are problems. First: When you provide at least one NAME argument to read, it automatically ignores leading and trailing whitespace in each field that is split off from the input string. This occurs whether $IFS is set to its default value or not, as described earlier in this post. Now, the OP may not care about this for his specific use-case, and in fact, it may be a desirable feature of the parsing behavior. But not everyone who wants to parse a string into fields will want this. There is a solution, however: A somewhat non-obvious usage of read is to pass zero NAME arguments. In this case, read will store the entire input line that it gets from the input stream in a variable named $REPLY, and, as a bonus, it does not strip leading and trailing whitespace from the value. This is a very robust usage of read which I've exploited frequently in my shell programming career. Here's a demonstration of the difference in behavior:
string=$' a b \n c d \n e f '; ## input string
a=(); while read -r line; do a+=("$line"); done <<<"$string"; declare -p a;
## declare -a a=([0]="a b" [1]="c d" [2]="e f") ## read trimmed surrounding whitespace
a=(); while read -r; do a+=("$REPLY"); done <<<"$string"; declare -p a;
## declare -a a=([0]=" a b " [1]=" c d " [2]=" e f ") ## no trimming
The second issue with this solution is that it does not actually address the case of a custom field separator, such as the OP's comma-space. As before, multicharacter separators are not supported, which is an unfortunate limitation of this solution. We could try to at least split on comma by specifying the separator to the -d option, but look what happens:
string='Paris, France, Europe';
a=(); while read -rd,; do a+=("$REPLY"); done <<<"$string"; declare -p a;
## declare -a a=([0]="Paris" [1]=" France")
Predictably, the unaccounted surrounding whitespace got pulled into the field values, and hence this would have to be corrected subsequently through trimming operations (this could also be done directly in the while-loop). But there's another obvious error: Europe is missing! What happened to it? The answer is that read returns a failing return code if it hits end-of-file (in this case we can call it end-of-string) without encountering a final field terminator on the final field. This causes the while-loop to break prematurely and we lose the final field.
Technically this same error afflicted the previous examples as well; the difference there is that the field separator was taken to be LF, which is the default when you don't specify the -d option, and the <<< ("here-string") mechanism automatically appends a LF to the string just before it feeds it as input to the command. Hence, in those cases, we sort of accidentally solved the problem of a dropped final field by unwittingly appending an additional dummy terminator to the input. Let's call this solution the "dummy-terminator" solution. We can apply the dummy-terminator solution manually for any custom delimiter by concatenating it against the input string ourselves when instantiating it in the here-string:
a=(); while read -rd,; do a+=("$REPLY"); done <<<"$string,"; declare -p a;
declare -a a=([0]="Paris" [1]=" France" [2]=" Europe")
There, problem solved. Another solution is to only break the while-loop if both (1) read returned failure and (2) $REPLY is empty, meaning read was not able to read any characters prior to hitting end-of-file. Demo:
a=(); while read -rd,|| [[ -n "$REPLY" ]]; do a+=("$REPLY"); done <<<"$string"; declare -p a;
## declare -a a=([0]="Paris" [1]=" France" [2]=$' Europe\n')
This approach also reveals the secretive LF that automatically gets appended to the here-string by the <<< redirection operator. It could of course be stripped off separately through an explicit trimming operation as described a moment ago, but obviously the manual dummy-terminator approach solves it directly, so we could just go with that. The manual dummy-terminator solution is actually quite convenient in that it solves both of these two problems (the dropped-final-field problem and the appended-LF problem) in one go.
So, overall, this is quite a powerful solution. It's only remaining weakness is a lack of support for multicharacter delimiters, which I will address later.
Wrong answer #8
string='first line
second line
third line'
readarray -t lines <<<"$string"
(This is actually from the same post as #7; the answerer provided two solutions in the same post.)
The readarray builtin, which is a synonym for mapfile, is ideal. It's a builtin command which parses a bytestream into an array variable in one shot; no messing with loops, conditionals, substitutions, or anything else. And it doesn't surreptitiously strip any whitespace from the input string. And (if -O is not given) it conveniently clears the target array before assigning to it. But it's still not perfect, hence my criticism of it as a "wrong answer".
First, just to get this out of the way, note that, just like the behavior of read when doing field-parsing, readarray drops the trailing field if it is empty. Again, this is probably not a concern for the OP, but it could be for some use-cases. I'll come back to this in a moment.
Second, as before, it does not support multicharacter delimiters. I'll give a fix for this in a moment as well.
Third, the solution as written does not parse the OP's input string, and in fact, it cannot be used as-is to parse it. I'll expand on this momentarily as well.
For the above reasons, I still consider this to be a "wrong answer" to the OP's question. Below I'll give what I consider to be the right answer.
Right answer
Here's a naïve attempt to make #8 work by just specifying the -d option:
string='Paris, France, Europe';
readarray -td, a <<<"$string"; declare -p a;
## declare -a a=([0]="Paris" [1]=" France" [2]=$' Europe\n')
We see the result is identical to the result we got from the double-conditional approach of the looping read solution discussed in #7. We can almost solve this with the manual dummy-terminator trick:
readarray -td, a <<<"$string,"; declare -p a;
## declare -a a=([0]="Paris" [1]=" France" [2]=" Europe" [3]=$'\n')
The problem here is that readarray preserved the trailing field, since the <<< redirection operator appended the LF to the input string, and therefore the trailing field was not empty (otherwise it would've been dropped). We can take care of this by explicitly unsetting the final array element after-the-fact:
readarray -td, a <<<"$string,"; unset 'a[-1]'; declare -p a;
## declare -a a=([0]="Paris" [1]=" France" [2]=" Europe")
The only two problems that remain, which are actually related, are (1) the extraneous whitespace that needs to be trimmed, and (2) the lack of support for multicharacter delimiters.
The whitespace could of course be trimmed afterward (for example, see How to trim whitespace from a Bash variable?). But if we can hack a multicharacter delimiter, then that would solve both problems in one shot.
Unfortunately, there's no direct way to get a multicharacter delimiter to work. The best solution I've thought of is to preprocess the input string to replace the multicharacter delimiter with a single-character delimiter that will be guaranteed not to collide with the contents of the input string. The only character that has this guarantee is the NUL byte. This is because, in bash (though not in zsh, incidentally), variables cannot contain the NUL byte. This preprocessing step can be done inline in a process substitution. Here's how to do it using awk:
readarray -td '' a < <(awk '{ gsub(/, /,"\0"); print; }' <<<"$string, "); unset 'a[-1]';
declare -p a;
## declare -a a=([0]="Paris" [1]="France" [2]="Europe")
There, finally! This solution will not erroneously split fields in the middle, will not cut out prematurely, will not drop empty fields, will not corrupt itself on filename expansions, will not automatically strip leading and trailing whitespace, will not leave a stowaway LF on the end, does not require loops, and does not settle for a single-character delimiter.
Trimming solution
Lastly, I wanted to demonstrate my own fairly intricate trimming solution using the obscure -C callback option of readarray. Unfortunately, I've run out of room against Stack Overflow's draconian 30,000 character post limit, so I won't be able to explain it. I'll leave that as an exercise for the reader.
function mfcb { local val="$4"; "$1"; eval "$2[$3]=\$val;"; };
function val_ltrim { if [[ "$val" =~ ^[[:space:]]+ ]]; then val="${val:${#BASH_REMATCH[0]}}"; fi; };
function val_rtrim { if [[ "$val" =~ [[:space:]]+$ ]]; then val="${val:0:${#val}-${#BASH_REMATCH[0]}}"; fi; };
function val_trim { val_ltrim; val_rtrim; };
readarray -c1 -C 'mfcb val_trim a' -td, <<<"$string,"; unset 'a[-1]'; declare -p a;
## declare -a a=([0]="Paris" [1]="France" [2]="Europe")
Here is a way without setting IFS:
string="1:2:3:4:5"
set -f # avoid globbing (expansion of *).
array=(${string//:/ })
for i in "${!array[#]}"
do
echo "$i=>${array[i]}"
done
The idea is using string replacement:
${string//substring/replacement}
to replace all matches of $substring with white space and then using the substituted string to initialize a array:
(element1 element2 ... elementN)
Note: this answer makes use of the split+glob operator. Thus, to prevent expansion of some characters (such as *) it is a good idea to pause globbing for this script.
t="one,two,three"
a=($(echo "$t" | tr ',' '\n'))
echo "${a[2]}"
Prints three
Sometimes it happened to me that the method described in the accepted answer didn't work, especially if the separator is a carriage return.
In those cases I solved in this way:
string='first line
second line
third line'
oldIFS="$IFS"
IFS='
'
IFS=${IFS:0:1} # this is useful to format your code with tabs
lines=( $string )
IFS="$oldIFS"
for line in "${lines[#]}"
do
echo "--> $line"
done
The accepted answer works for values in one line. If the variable has several lines:
string='first line
second line
third line'
We need a very different command to get all lines:
while read -r line; do lines+=("$line"); done <<<"$string"
Or the much simpler bash readarray:
readarray -t lines <<<"$string"
Printing all lines is very easy taking advantage of a printf feature:
printf ">[%s]\n" "${lines[#]}"
>[first line]
>[ second line]
>[ third line]
if you use macOS and can't use readarray, you can simply do this-
MY_STRING="string1 string2 string3"
array=($MY_STRING)
To iterate over the elements:
for element in "${array[#]}"
do
echo $element
done
This works for me on OSX:
string="1 2 3 4 5"
declare -a array=($string)
If your string has different delimiter, just 1st replace those with space:
string="1,2,3,4,5"
delimiter=","
declare -a array=($(echo $string | tr "$delimiter" " "))
Simple :-)
This is similar to the approach by Jmoney38, but using sed:
string="1,2,3,4"
array=(`echo $string | sed 's/,/\n/g'`)
echo ${array[0]}
Prints 1
The key to splitting your string into an array is the multi character delimiter of ", ". Any solution using IFS for multi character delimiters is inherently wrong since IFS is a set of those characters, not a string.
If you assign IFS=", " then the string will break on EITHER "," OR " " or any combination of them which is not an accurate representation of the two character delimiter of ", ".
You can use awk or sed to split the string, with process substitution:
#!/bin/bash
str="Paris, France, Europe"
array=()
while read -r -d $'\0' each; do # use a NUL terminated field separator
array+=("$each")
done < <(printf "%s" "$str" | awk '{ gsub(/,[ ]+|$/,"\0"); print }')
declare -p array
# declare -a array=([0]="Paris" [1]="France" [2]="Europe") output
It is more efficient to use a regex you directly in Bash:
#!/bin/bash
str="Paris, France, Europe"
array=()
while [[ $str =~ ([^,]+)(,[ ]+|$) ]]; do
array+=("${BASH_REMATCH[1]}") # capture the field
i=${#BASH_REMATCH} # length of field + delimiter
str=${str:i} # advance the string by that length
done # the loop deletes $str, so make a copy if needed
declare -p array
# declare -a array=([0]="Paris" [1]="France" [2]="Europe") output...
With the second form, there is no sub shell and it will be inherently faster.
Edit by bgoldst: Here are some benchmarks comparing my readarray solution to dawg's regex solution, and I also included the read solution for the heck of it (note: I slightly modified the regex solution for greater harmony with my solution) (also see my comments below the post):
## competitors
function c_readarray { readarray -td '' a < <(awk '{ gsub(/, /,"\0"); print; };' <<<"$1, "); unset 'a[-1]'; };
function c_read { a=(); local REPLY=''; while read -r -d ''; do a+=("$REPLY"); done < <(awk '{ gsub(/, /,"\0"); print; };' <<<"$1, "); };
function c_regex { a=(); local s="$1, "; while [[ $s =~ ([^,]+),\ ]]; do a+=("${BASH_REMATCH[1]}"); s=${s:${#BASH_REMATCH}}; done; };
## helper functions
function rep {
local -i i=-1;
for ((i = 0; i<$1; ++i)); do
printf %s "$2";
done;
}; ## end rep()
function testAll {
local funcs=();
local args=();
local func='';
local -i rc=-1;
while [[ "$1" != ':' ]]; do
func="$1";
if [[ ! "$func" =~ ^[_a-zA-Z][_a-zA-Z0-9]*$ ]]; then
echo "bad function name: $func" >&2;
return 2;
fi;
funcs+=("$func");
shift;
done;
shift;
args=("$#");
for func in "${funcs[#]}"; do
echo -n "$func ";
{ time $func "${args[#]}" >/dev/null 2>&1; } 2>&1| tr '\n' '/';
rc=${PIPESTATUS[0]}; if [[ $rc -ne 0 ]]; then echo "[$rc]"; else echo; fi;
done| column -ts/;
}; ## end testAll()
function makeStringToSplit {
local -i n=$1; ## number of fields
if [[ $n -lt 0 ]]; then echo "bad field count: $n" >&2; return 2; fi;
if [[ $n -eq 0 ]]; then
echo;
elif [[ $n -eq 1 ]]; then
echo 'first field';
elif [[ "$n" -eq 2 ]]; then
echo 'first field, last field';
else
echo "first field, $(rep $[$1-2] 'mid field, ')last field";
fi;
}; ## end makeStringToSplit()
function testAll_splitIntoArray {
local -i n=$1; ## number of fields in input string
local s='';
echo "===== $n field$(if [[ $n -ne 1 ]]; then echo 's'; fi;) =====";
s="$(makeStringToSplit "$n")";
testAll c_readarray c_read c_regex : "$s";
}; ## end testAll_splitIntoArray()
## results
testAll_splitIntoArray 1;
## ===== 1 field =====
## c_readarray real 0m0.067s user 0m0.000s sys 0m0.000s
## c_read real 0m0.064s user 0m0.000s sys 0m0.000s
## c_regex real 0m0.000s user 0m0.000s sys 0m0.000s
##
testAll_splitIntoArray 10;
## ===== 10 fields =====
## c_readarray real 0m0.067s user 0m0.000s sys 0m0.000s
## c_read real 0m0.064s user 0m0.000s sys 0m0.000s
## c_regex real 0m0.001s user 0m0.000s sys 0m0.000s
##
testAll_splitIntoArray 100;
## ===== 100 fields =====
## c_readarray real 0m0.069s user 0m0.000s sys 0m0.062s
## c_read real 0m0.065s user 0m0.000s sys 0m0.046s
## c_regex real 0m0.005s user 0m0.000s sys 0m0.000s
##
testAll_splitIntoArray 1000;
## ===== 1000 fields =====
## c_readarray real 0m0.084s user 0m0.031s sys 0m0.077s
## c_read real 0m0.092s user 0m0.031s sys 0m0.046s
## c_regex real 0m0.125s user 0m0.125s sys 0m0.000s
##
testAll_splitIntoArray 10000;
## ===== 10000 fields =====
## c_readarray real 0m0.209s user 0m0.093s sys 0m0.108s
## c_read real 0m0.333s user 0m0.234s sys 0m0.109s
## c_regex real 0m9.095s user 0m9.078s sys 0m0.000s
##
testAll_splitIntoArray 100000;
## ===== 100000 fields =====
## c_readarray real 0m1.460s user 0m0.326s sys 0m1.124s
## c_read real 0m2.780s user 0m1.686s sys 0m1.092s
## c_regex real 17m38.208s user 15m16.359s sys 2m19.375s
##
enter code herePure bash multi-character delimiter solution.
As others have pointed out in this thread, the OP's question gave an example of a comma delimited string to be parsed into an array, but did not indicate if he/she was only interested in comma delimiters, single character delimiters, or multi-character delimiters.
Since Google tends to rank this answer at or near the top of search results, I wanted to provide readers with a strong answer to the question of multiple character delimiters, since that is also mentioned in at least one response.
If you're in search of a solution to a multi-character delimiter problem, I suggest reviewing Mallikarjun M's post, in particular the response from gniourf_gniourf
who provides this elegant pure BASH solution using parameter expansion:
#!/bin/bash
str="LearnABCtoABCSplitABCaABCString"
delimiter=ABC
s=$str$delimiter
array=();
while [[ $s ]]; do
array+=( "${s%%"$delimiter"*}" );
s=${s#*"$delimiter"};
done;
declare -p array
Link to cited comment/referenced post
Link to cited question: Howto split a string on a multi-character delimiter in bash?
Update 3 Aug 2022
xebeche raised a good point in comments below. After reviewing their suggested edits, I've revised the script provided by gniourf_gniourf, and added remarks for ease of understanding what the script is doing. I also changed the double brackets [[]] to single brackets, for greater compatibility since many SHell variants do not support double bracket notation. In this case, for BaSH, the logic works inside single or double brackets.
#!/bin/bash
str="LearnABCtoABCSplitABCABCaABCStringABC"
delimiter="ABC"
array=()
while [ "$str" ]; do
# parse next sub-string, left of next delimiter
substring="${str%%"$delimiter"*}"
# when substring = delimiter, truncate leading delimiter
# (i.e. pattern is "$delimiter$delimiter")
[ -z "$substring" ] && str="${str#"$delimiter"}" && continue
# create next array element with parsed substring
array+=( "$substring" )
# remaining string to the right of delimiter becomes next string to be evaluated
str="${str:${#substring}}"
# prevent infinite loop when last substring = delimiter
[ "$str" == "$delimiter" ] && break
done
declare -p array
Without the comments:
#!/bin/bash
str="LearnABCtoABCSplitABCABCaABCStringABC"
delimiter="ABC"
array=()
while [ "$str" ]; do
substring="${str%%"$delimiter"*}"
[ -z "$substring" ] && str="${str#"$delimiter"}" && continue
array+=( "$substring" )
str="${str:${#substring}}"
[ "$str" == "$delimiter" ] && break
done
declare -p array
I was curious about the relative performance of the "Right answer"
in the popular answer by #bgoldst, with its apparent decrying of loops,
so I have done a simple benchmark of it against three pure bash implementations.
In summary, I suggest:
for string length < 4k or so, pure bash is faster than gawk
for delimiter length < 10 and string length < 256k, pure bash is comparable to gawk
for delimiter length >> 10 and string length < 64k or so, pure bash is "acceptable";
and gawk is less than 5x faster
for string length < 512k or so, gawk is "acceptable"
I arbitrarily define "acceptable" as "takes < 0.5s to split the string".
I am taking the problem to be to take a bash string and split it into a bash array, using an arbitrary-length delimiter string (not regex).
# in: $1=delim, $2=string
# out: sets array a
My pure bash implementations are:
# naive approach - slow
split_byStr_bash_naive(){
a=()
local prev=""
local cdr="$2"
[[ -z "${cdr}" ]] && a+=("")
while [[ "$cdr" != "$prev" ]]; do
prev="$cdr"
a+=( "${cdr%%"$1"*}" )
cdr="${cdr#*"$1"}"
done
# echo $( declare -p a | md5sum; declare -p a )
}
# use lengths wherever possible - faster
split_byStr_bash_faster(){
a=()
local car=""
local cdr="$2"
while
car="${cdr%%"$1"*}"
a+=("$car")
cdr="${cdr:${#car}}"
(( ${#cdr} ))
do
cdr="${cdr:${#1}}"
done
# echo $( declare -p a | md5sum; declare -p a )
}
# use pattern substitution and readarray - fastest
split_byStr_bash_sub(){
a=()
local delim="$1" string="$2"
delim="${delim//=/=-}"
delim="${delim//$'\n'/=n}"
string="${string//=/=-}"
string="${string//$'\n'/=n}"
readarray -td $'\n' a <<<"${string//"$delim"/$'\n'}"
local len=${#a[#]} i s
for (( i=0; i<len; i++ )); do
s="${a[$i]//=n/$'\n'}"
a[$i]="${s//=-/=}"
done
# echo $( declare -p a | md5sum; declare -p a )
}
The initial -z test in in the naive version handles the case of a zero-length
string being passed. Without the test, the output array is empty;
with it, the array has a single zero-length element.
Replacing readarray with while read gives < 10% slowdown.
This is the gawk implementation I used:
split_byRE_gawk(){
readarray -td '' a < <(awk '{gsub(/'"$1"'/,"\0")}1' <<<"$2$1")
unset 'a[-1]'
# echo $( declare -p a | md5sum; declare -p a )
}
Obviously, in the general case, the delim argument will need to be sanitised,
as gawk expects a regex, and gawk-special characters could cause problems.
Also, as-is, the implementation won't correctly handle newlines in the delimiter.
Since gawk is being used, a generalised version that handles more arbitrary
delimiters could be:
split_byREorStr_gawk(){
local delim=$1
local string=$2
local useRegex=${3:+1} # if set, delimiter is regex
readarray -td '' a < <(
export delim
gawk -v re="$useRegex" '
BEGIN {
RS = FS = "\0"
ORS = ""
d = ENVIRON["delim"]
# cf. https://stackoverflow.com/a/37039138
if (!re) gsub(/[\\.^$(){}\[\]|*+?]/,"\\\\&",d)
}
gsub(d"|\n$","\0")
' <<<"$string"
)
# echo $( declare -p a | md5sum; declare -p a )
}
or the same idea in Perl:
split_byREorStr_perl(){
local delim=$1
local string=$2
local regex=$3 # if set, delimiter is regex
readarray -td '' a < <(
export delim regex
perl -0777pe '
$d = $ENV{delim};
$d = "\Q$d\E" if ! $ENV{regex};
s/$d|\n$/\0/g;
' <<<"$string"
)
# echo $( declare -p a | md5sum; declare -p a )
}
The implementations produce identical output, tested by comparing md5sum separately.
Note that if input had been ambiguous ("logically incorrect" as #bgoldst puts it),
behaviour would diverge slightly. For example, with delimiter -- and string a- or a---:
#goldst's code returns: declare -a a=([0]="a") or declare -a a=([0]="a" [1]="")
mine return: declare -a a=([0]="a-") or declare -a a=([0]="a" [1]="-")
Arguments were derived with simple Perl scripts from:
delim="-=-="
base="ABCDEFGHIJKLMNOPQRSTUVWXYZ012345"
Here are the tables of timing results (in seconds) for 3 different types
of string and delimiter argument.
#s - length of string argument
#d - length of delim argument
= - performance break-even point
! - "acceptable" performance limit (bash) is somewhere around here
!! - "acceptable" performance limit (gawk) is somewhere around here
- - function took too long
<!> - gawk command failed to run
Type 1
d=$(perl -e "print( '$delim' x (7*2**$n) )")
s=$(perl -e "print( '$delim' x (7*2**$n) . '$base' x (7*2**$n) )")
n
#s
#d
gawk
b_sub
b_faster
b_naive
0
252
28
0.002
0.000
0.000
0.000
1
504
56
0.005
0.000
0.000
0.001
2
1008
112
0.005
0.001
0.000
0.003
3
2016
224
0.006
0.001
0.000
0.009
4
4032
448
0.007
0.002
0.001
0.048
=
5
8064
896
0.014
0.008
0.005
0.377
6
16128
1792
0.018
0.029
0.017
(2.214)
7
32256
3584
0.033
0.057
0.039
(15.16)
!
8
64512
7168
0.063
0.214
0.128
-
9
129024
14336
0.111
(0.826)
(0.602)
-
10
258048
28672
0.214
(3.383)
(2.652)
-
!!
11
516096
57344
0.430
(13.46)
(11.00)
-
12
1032192
114688
(0.834)
(58.38)
-
-
13
2064384
229376
<!>
(228.9)
-
-
Type 2
d=$(perl -e "print( '$delim' x ($n) )")
s=$(perl -e "print( ('$delim' x ($n) . '$base' x $n ) x (2**($n-1)) )")
n
#s
#d
gawk
b_sub
b_faster
b_naive
0
0
0
0.003
0.000
0.000
0.000
1
36
4
0.003
0.000
0.000
0.000
2
144
8
0.005
0.000
0.000
0.000
3
432
12
0.005
0.000
0.000
0.000
4
1152
16
0.005
0.001
0.001
0.002
5
2880
20
0.005
0.001
0.002
0.003
6
6912
24
0.006
0.003
0.009
0.014
=
7
16128
28
0.012
0.012
0.037
0.044
8
36864
32
0.023
0.044
0.167
0.187
!
9
82944
36
0.049
0.192
(0.753)
(0.840)
10
184320
40
0.097
(0.925)
(3.682)
(4.016)
11
405504
44
0.204
(4.709)
(18.00)
(19.58)
!!
12
884736
48
0.444
(22.17)
-
-
13
1916928
52
(1.019)
(102.4)
-
-
Type 3
d=$(perl -e "print( '$delim' x (2**($n-1)) )")
s=$(perl -e "print( ('$delim' x (2**($n-1)) . '$base' x (2**($n-1)) ) x ($n) )")
n
#s
#d
gawk
b_sub
b_faster
b_naive
0
0
0
0.000
0.000
0.000
0.000
1
36
4
0.004
0.000
0.000
0.000
2
144
8
0.003
0.000
0.000
0.000
3
432
16
0.003
0.000
0.000
0.000
4
1152
32
0.005
0.001
0.001
0.002
5
2880
64
0.005
0.002
0.001
0.003
6
6912
128
0.006
0.003
0.003
0.014
=
7
16128
256
0.012
0.011
0.010
0.077
8
36864
512
0.023
0.046
0.046
(0.513)
!
9
82944
1024
0.049
0.195
0.197
(3.850)
10
184320
2048
0.103
(0.951)
(1.061)
(31.84)
11
405504
4096
0.222
(4.796)
-
-
!!
12
884736
8192
0.473
(22.88)
-
-
13
1916928
16384
(1.126)
(105.4)
-
-
Summary of delimiters length 1..10
As short delimiters are probably more likely than long,
summarised below are the results of varying delimiter length
between 1 and 10 (results for 2..9 mostly elided as very similar).
s1=$(perl -e "print( '$d' . '$base' x (7*2**$n) )")
s2=$(perl -e "print( ('$d' . '$base' x $n ) x (2**($n-1)) )")
s3=$(perl -e "print( ('$d' . '$base' x (2**($n-1)) ) x ($n) )")
bash_sub < gawk
string
n
#s
#d
gawk
b_sub
b_faster
b_naive
s1
10
229377
1
0.131
0.089
1.709
-
s1
10
229386
10
0.142
0.095
1.907
-
s2
8
32896
1
0.022
0.007
0.148
0.168
s2
8
34048
10
0.021
0.021
0.163
0.179
s3
12
786444
1
0.436
0.468
-
-
s3
12
786456
2
0.434
0.317
-
-
s3
12
786552
10
0.438
0.333
-
-
bash_sub < 0.5s
string
n
#s
#d
gawk
b_sub
b_faster
b_naive
s1
11
458753
1
0.256
0.332
(7.089)
-
s1
11
458762
10
0.269
0.387
(8.003)
-
s2
11
361472
1
0.205
0.283
(14.54)
-
s2
11
363520
3
0.207
0.462
(16.66)
-
s3
12
786444
1
0.436
0.468
-
-
s3
12
786456
2
0.434
0.317
-
-
s3
12
786552
10
0.438
0.333
-
-
gawk < 0.5s
string
n
#s
$d
gawk
b_sub
b_faster
b_naive
s1
11
458753
1
0.256
0.332
(7.089)
-
s1
11
458762
10
0.269
0.387
(8.003)
-
s2
12
788480
1
0.440
(1.252)
-
-
s2
12
806912
10
0.449
(4.968)
-
-
s3
12
786444
1
0.436
0.468
-
-
s3
12
786456
2
0.434
0.317
-
-
s3
12
786552
10
0.438
0.333
-
-
(I'm not entirely sure why bash_sub with s>160k and d=1 was consistently slower than d>1 for s3.)
All tests carried out with bash 5.0.17 on an Intel i7-7500U running xubuntu 20.04.
Try this
IFS=', '; array=(Paris, France, Europe)
for item in ${array[#]}; do echo $item; done
It's simple. If you want, you can also add a declare (and also remove the commas):
IFS=' ';declare -a array=(Paris France Europe)
The IFS is added to undo the above but it works without it in a fresh bash instance
#!/bin/bash
string="a | b c"
pattern=' | '
# replaces pattern with newlines
splitted="$(sed "s/$pattern/\n/g" <<< "$string")"
# Reads lines and put them in array
readarray -t array2 <<< "$splitted"
# Prints number of elements
echo ${#array2[#]}
# Prints all elements
for a in "${array2[#]}"; do
echo "> '$a'"
done
This solution works for larger delimiters (more than one char).
Doesn't work if you have a newline already in the original string
This works for the given data:
$ aaa='Paris, France, Europe'
$ mapfile -td ',' aaaa < <(echo -n "${aaa//, /,}")
$ declare -p aaaa
Result:
declare -a aaaa=([0]="Paris" [1]="France" [2]="Europe")
And it will also work for extended data with spaces, such as "New York":
$ aaa="New York, Paris, New Jersey, Hampshire"
$ mapfile -td ',' aaaa < <(echo -n "${aaa//, /,}")
$ declare -p aaaa
Result:
declare -a aaaa=([0]="New York" [1]="Paris" [2]="New Jersey" [3]="Hampshire")
Another way to do it without modifying IFS:
read -r -a myarray <<< "${string//, /$IFS}"
Rather than changing IFS to match our desired delimiter, we can replace all occurrences of our desired delimiter ", " with contents of $IFS via "${string//, /$IFS}".
Maybe this will be slow for very large strings though?
This is based on Dennis Williamson's answer.
I came across this post when looking to parse an input like:
word1,word2,...
none of the above helped me. solved it by using awk. If it helps someone:
STRING="value1,value2,value3"
array=`echo $STRING | awk -F ',' '{ s = $1; for (i = 2; i <= NF; i++) s = s "\n"$i; print s; }'`
for word in ${array}
do
echo "This is the word $word"
done
UPDATE: Don't do this, due to problems with eval.
With slightly less ceremony:
IFS=', ' eval 'array=($string)'
e.g.
string="foo, bar,baz"
IFS=', ' eval 'array=($string)'
echo ${array[1]} # -> bar
Do not change IFS!
Here's a simple bash one-liner:
read -a my_array <<< $(echo ${INPUT_STRING} | tr -d ' ' | tr ',' ' ')
Here's my hack!
Splitting strings by strings is a pretty boring thing to do using bash. What happens is that we have limited approaches that only work in a few cases (split by ";", "/", "." and so on) or we have a variety of side effects in the outputs.
The approach below has required a number of maneuvers, but I believe it will work for most of our needs!
#!/bin/bash
# --------------------------------------
# SPLIT FUNCTION
# ----------------
F_SPLIT_R=()
f_split() {
: 'It does a "split" into a given string and returns an array.
Args:
TARGET_P (str): Target string to "split".
DELIMITER_P (Optional[str]): Delimiter used to "split". If not
informed the split will be done by spaces.
Returns:
F_SPLIT_R (array): Array with the provided string separated by the
informed delimiter.
'
F_SPLIT_R=()
TARGET_P=$1
DELIMITER_P=$2
if [ -z "$DELIMITER_P" ] ; then
DELIMITER_P=" "
fi
REMOVE_N=1
if [ "$DELIMITER_P" == "\n" ] ; then
REMOVE_N=0
fi
# NOTE: This was the only parameter that has been a problem so far!
# By Questor
# [Ref.: https://unix.stackexchange.com/a/390732/61742]
if [ "$DELIMITER_P" == "./" ] ; then
DELIMITER_P="[.]/"
fi
if [ ${REMOVE_N} -eq 1 ] ; then
# NOTE: Due to bash limitations we have some problems getting the
# output of a split by awk inside an array and so we need to use
# "line break" (\n) to succeed. Seen this, we remove the line breaks
# momentarily afterwards we reintegrate them. The problem is that if
# there is a line break in the "string" informed, this line break will
# be lost, that is, it is erroneously removed in the output!
# By Questor
TARGET_P=$(awk 'BEGIN {RS="dn"} {gsub("\n", "3F2C417D448C46918289218B7337FCAF"); printf $0}' <<< "${TARGET_P}")
fi
# NOTE: The replace of "\n" by "3F2C417D448C46918289218B7337FCAF" results
# in more occurrences of "3F2C417D448C46918289218B7337FCAF" than the
# amount of "\n" that there was originally in the string (one more
# occurrence at the end of the string)! We can not explain the reason for
# this side effect. The line below corrects this problem! By Questor
TARGET_P=${TARGET_P%????????????????????????????????}
SPLIT_NOW=$(awk -F"$DELIMITER_P" '{for(i=1; i<=NF; i++){printf "%s\n", $i}}' <<< "${TARGET_P}")
while IFS= read -r LINE_NOW ; do
if [ ${REMOVE_N} -eq 1 ] ; then
# NOTE: We use "'" to prevent blank lines with no other characters
# in the sequence being erroneously removed! We do not know the
# reason for this side effect! By Questor
LN_NOW_WITH_N=$(awk 'BEGIN {RS="dn"} {gsub("3F2C417D448C46918289218B7337FCAF", "\n"); printf $0}' <<< "'${LINE_NOW}'")
# NOTE: We use the commands below to revert the intervention made
# immediately above! By Questor
LN_NOW_WITH_N=${LN_NOW_WITH_N%?}
LN_NOW_WITH_N=${LN_NOW_WITH_N#?}
F_SPLIT_R+=("$LN_NOW_WITH_N")
else
F_SPLIT_R+=("$LINE_NOW")
fi
done <<< "$SPLIT_NOW"
}
# --------------------------------------
# HOW TO USE
# ----------------
STRING_TO_SPLIT="
* How do I list all databases and tables using psql?
\"
sudo -u postgres /usr/pgsql-9.4/bin/psql -c \"\l\"
sudo -u postgres /usr/pgsql-9.4/bin/psql <DB_NAME> -c \"\dt\"
\"
\"
\list or \l: list all databases
\dt: list all tables in the current database
\"
[Ref.: https://dba.stackexchange.com/questions/1285/how-do-i-list-all-databases-and-tables-using-psql]
"
f_split "$STRING_TO_SPLIT" "bin/psql -c"
# --------------------------------------
# OUTPUT AND TEST
# ----------------
ARR_LENGTH=${#F_SPLIT_R[*]}
for (( i=0; i<=$(( $ARR_LENGTH -1 )); i++ )) ; do
echo " > -----------------------------------------"
echo "${F_SPLIT_R[$i]}"
echo " < -----------------------------------------"
done
if [ "$STRING_TO_SPLIT" == "${F_SPLIT_R[0]}bin/psql -c${F_SPLIT_R[1]}" ] ; then
echo " > -----------------------------------------"
echo "The strings are the same!"
echo " < -----------------------------------------"
fi
For multilined elements, why not something like
$ array=($(echo -e $'a a\nb b' | tr ' ' '§')) && array=("${array[#]//§/ }") && echo "${array[#]/%/ INTERELEMENT}"
a a INTERELEMENT b b INTERELEMENT
Since there are so many ways to solve this, let's start by defining what we want to see in our solution.
Bash provides a builtin readarray for this purpose. Let's use it.
Avoid ugly and unnecessary tricks such as changing IFS, looping, using eval, or adding an extra element then removing it.
Find a simple, readable approach that can easily be adapted to similar problems.
The readarray command is easiest to use with newlines as the delimiter. With other delimiters it may add an extra element to the array. The cleanest approach is to first adapt our input into a form that works nicely with readarray before passing it in.
The input in this example does not have a multi-character delimiter. If we apply a little common sense, it's best understood as comma separated input for which each element may need to be trimmed. My solution is to split the input by comma into multiple lines, trim each element, and pass it all to readarray.
string=' Paris,France , All of Europe '
readarray -t foo < <(tr ',' '\n' <<< "$string" |sed 's/^ *//' |sed 's/ *$//')
# Result:
declare -p foo
# declare -a foo='([0]="Paris" [1]="France" [2]="All of Europe")'
EDIT: My solution allows inconsistent spacing around comma separators, while also allowing elements to contain spaces. Few other solutions can handle these special cases.
I also avoid approaches which seem like hacks, such as creating an extra array element and then removing it. If you don't agree it's the best answer here, please leave a comment to explain.
If you'd like to try the same approach purely in Bash and with fewer subshells, it's possible. But the result is harder to read, and this optimization is probably unnecessary.
string=' Paris,France , All of Europe '
foo="${string#"${string%%[![:space:]]*}"}"
foo="${foo%"${foo##*[![:space:]]}"}"
foo="${foo//+([[:space:]]),/,}"
foo="${foo//,+([[:space:]])/,}"
readarray -t foo < <(echo "$foo")
Another way would be:
string="Paris, France, Europe"
IFS=', ' arr=(${string})
Now your elements are stored in "arr" array.
To iterate through the elements:
for i in ${arr[#]}; do echo $i; done
Another approach can be:
str="a, b, c, d" # assuming there is a space after ',' as in Q
arr=(${str//,/}) # delete all occurrences of ','
After this 'arr' is an array with four strings.
This doesn't require dealing IFS or read or any other special stuff hence much simpler and direct.

Shell script array from command line

I'm trying to write a shell script that can accept multiple elements on the command line to be treated as a single array. The command line argument format is:
exec trial.sh 1 2 {element1 element2} 4
I know that the first two arguments are can be accessed with $1 and $2, but how can I access the array surrounded by the brackets, that is the arguments surrounded by the {} symbols?
Thanks!
This tcl script uses regex parsing to extract pieces of the commandline, transforming your third argument into a list.
Splitting is done on whitespaces - depending on where you want to use this may or may not be sufficient.
#!/usr/bin/env tclsh
#
# Sample arguments: 1 2 {element1 element2} 4
# Split the commandline arguments:
# - tcl will represent the curly brackets as \{ which makes the regex a bit ugly as we have to escape this
# - we use '->' to catch the full regex match as we are not interested in the value and it looks good
# - we are splitting on white spaces here
# - the content between the curly braces is extracted
regexp {(.*?)\s(.*?)\s\\\{(.*?)\\\}\s(.*?)$} $::argv -> first second third fourth
puts "Argument extraction:"
puts "argv: $::argv"
puts "arg1: $first"
puts "arg2: $second"
puts "arg3: $third"
puts "arg4: $fourth"
# Third argument is to be treated as an array, again split on white space
set theArguments [regexp -all -inline {\S+} $third]
puts "\nArguments for parameter 3"
foreach arg $theArguments {
puts "arg: $arg"
}
You should always place variable length arguments at the end. But if you can guarantee you always mjust provide the last argument, then something like this will suffice:
#!/bin/bash
arg1=$1 ; shift
arg2=$1 ; shift
# Get the array passed in.
arrArgs=()
while (( $# > 1 )) ; do
arrArgs=( "${arrArgs[#]}" "$1" )
shift
done
lastArg=$1 ; shift

How to split a string into an array in Bash?

In a Bash script, I would like to split a line into pieces and store them in an array.
For example, given the line:
Paris, France, Europe
I would like to have the resulting array to look like so:
array[0] = Paris
array[1] = France
array[2] = Europe
A simple implementation is preferable; speed does not matter. How can I do it?
IFS=', ' read -r -a array <<< "$string"
Note that the characters in $IFS are treated individually as separators so that in this case fields may be separated by either a comma or a space rather than the sequence of the two characters. Interestingly though, empty fields aren't created when comma-space appears in the input because the space is treated specially.
To access an individual element:
echo "${array[0]}"
To iterate over the elements:
for element in "${array[#]}"
do
echo "$element"
done
To get both the index and the value:
for index in "${!array[#]}"
do
echo "$index ${array[index]}"
done
The last example is useful because Bash arrays are sparse. In other words, you can delete an element or add an element and then the indices are not contiguous.
unset "array[1]"
array[42]=Earth
To get the number of elements in an array:
echo "${#array[#]}"
As mentioned above, arrays can be sparse so you shouldn't use the length to get the last element. Here's how you can in Bash 4.2 and later:
echo "${array[-1]}"
in any version of Bash (from somewhere after 2.05b):
echo "${array[#]: -1:1}"
Larger negative offsets select farther from the end of the array. Note the space before the minus sign in the older form. It is required.
All of the answers to this question are wrong in one way or another.
Wrong answer #1
IFS=', ' read -r -a array <<< "$string"
1: This is a misuse of $IFS. The value of the $IFS variable is not taken as a single variable-length string separator, rather it is taken as a set of single-character string separators, where each field that read splits off from the input line can be terminated by any character in the set (comma or space, in this example).
Actually, for the real sticklers out there, the full meaning of $IFS is slightly more involved. From the bash manual:
The shell treats each character of IFS as a delimiter, and splits the results of the other expansions into words using these characters as field terminators. If IFS is unset, or its value is exactly <space><tab><newline>, the default, then sequences of <space>, <tab>, and <newline> at the beginning and end of the results of the previous expansions are ignored, and any sequence of IFS characters not at the beginning or end serves to delimit words. If IFS has a value other than the default, then sequences of the whitespace characters <space>, <tab>, and <newline> are ignored at the beginning and end of the word, as long as the whitespace character is in the value of IFS (an IFS whitespace character). Any character in IFS that is not IFS whitespace, along with any adjacent IFS whitespace characters, delimits a field. A sequence of IFS whitespace characters is also treated as a delimiter. If the value of IFS is null, no word splitting occurs.
Basically, for non-default non-null values of $IFS, fields can be separated with either (1) a sequence of one or more characters that are all from the set of "IFS whitespace characters" (that is, whichever of <space>, <tab>, and <newline> ("newline" meaning line feed (LF)) are present anywhere in $IFS), or (2) any non-"IFS whitespace character" that's present in $IFS along with whatever "IFS whitespace characters" surround it in the input line.
For the OP, it's possible that the second separation mode I described in the previous paragraph is exactly what he wants for his input string, but we can be pretty confident that the first separation mode I described is not correct at all. For example, what if his input string was 'Los Angeles, United States, North America'?
IFS=', ' read -ra a <<<'Los Angeles, United States, North America'; declare -p a;
## declare -a a=([0]="Los" [1]="Angeles" [2]="United" [3]="States" [4]="North" [5]="America")
2: Even if you were to use this solution with a single-character separator (such as a comma by itself, that is, with no following space or other baggage), if the value of the $string variable happens to contain any LFs, then read will stop processing once it encounters the first LF. The read builtin only processes one line per invocation. This is true even if you are piping or redirecting input only to the read statement, as we are doing in this example with the here-string mechanism, and thus unprocessed input is guaranteed to be lost. The code that powers the read builtin has no knowledge of the data flow within its containing command structure.
You could argue that this is unlikely to cause a problem, but still, it's a subtle hazard that should be avoided if possible. It is caused by the fact that the read builtin actually does two levels of input splitting: first into lines, then into fields. Since the OP only wants one level of splitting, this usage of the read builtin is not appropriate, and we should avoid it.
3: A non-obvious potential issue with this solution is that read always drops the trailing field if it is empty, although it preserves empty fields otherwise. Here's a demo:
string=', , a, , b, c, , , '; IFS=', ' read -ra a <<<"$string"; declare -p a;
## declare -a a=([0]="" [1]="" [2]="a" [3]="" [4]="b" [5]="c" [6]="" [7]="")
Maybe the OP wouldn't care about this, but it's still a limitation worth knowing about. It reduces the robustness and generality of the solution.
This problem can be solved by appending a dummy trailing delimiter to the input string just prior to feeding it to read, as I will demonstrate later.
Wrong answer #2
string="1:2:3:4:5"
set -f # avoid globbing (expansion of *).
array=(${string//:/ })
Similar idea:
t="one,two,three"
a=($(echo $t | tr ',' "\n"))
(Note: I added the missing parentheses around the command substitution which the answerer seems to have omitted.)
Similar idea:
string="1,2,3,4"
array=(`echo $string | sed 's/,/\n/g'`)
These solutions leverage word splitting in an array assignment to split the string into fields. Funnily enough, just like read, general word splitting also uses the $IFS special variable, although in this case it is implied that it is set to its default value of <space><tab><newline>, and therefore any sequence of one or more IFS characters (which are all whitespace characters now) is considered to be a field delimiter.
This solves the problem of two levels of splitting committed by read, since word splitting by itself constitutes only one level of splitting. But just as before, the problem here is that the individual fields in the input string can already contain $IFS characters, and thus they would be improperly split during the word splitting operation. This happens to not be the case for any of the sample input strings provided by these answerers (how convenient...), but of course that doesn't change the fact that any code base that used this idiom would then run the risk of blowing up if this assumption were ever violated at some point down the line. Once again, consider my counterexample of 'Los Angeles, United States, North America' (or 'Los Angeles:United States:North America').
Also, word splitting is normally followed by filename expansion (aka pathname expansion aka globbing), which, if done, would potentially corrupt words containing the characters *, ?, or [ followed by ] (and, if extglob is set, parenthesized fragments preceded by ?, *, +, #, or !) by matching them against file system objects and expanding the words ("globs") accordingly. The first of these three answerers has cleverly undercut this problem by running set -f beforehand to disable globbing. Technically this works (although you should probably add set +f afterward to reenable globbing for subsequent code which may depend on it), but it's undesirable to have to mess with global shell settings in order to hack a basic string-to-array parsing operation in local code.
Another issue with this answer is that all empty fields will be lost. This may or may not be a problem, depending on the application.
Note: If you're going to use this solution, it's better to use the ${string//:/ } "pattern substitution" form of parameter expansion, rather than going to the trouble of invoking a command substitution (which forks the shell), starting up a pipeline, and running an external executable (tr or sed), since parameter expansion is purely a shell-internal operation. (Also, for the tr and sed solutions, the input variable should be double-quoted inside the command substitution; otherwise word splitting would take effect in the echo command and potentially mess with the field values. Also, the $(...) form of command substitution is preferable to the old `...` form since it simplifies nesting of command substitutions and allows for better syntax highlighting by text editors.)
Wrong answer #3
str="a, b, c, d" # assuming there is a space after ',' as in Q
arr=(${str//,/}) # delete all occurrences of ','
This answer is almost the same as #2. The difference is that the answerer has made the assumption that the fields are delimited by two characters, one of which being represented in the default $IFS, and the other not. He has solved this rather specific case by removing the non-IFS-represented character using a pattern substitution expansion and then using word splitting to split the fields on the surviving IFS-represented delimiter character.
This is not a very generic solution. Furthermore, it can be argued that the comma is really the "primary" delimiter character here, and that stripping it and then depending on the space character for field splitting is simply wrong. Once again, consider my counterexample: 'Los Angeles, United States, North America'.
Also, again, filename expansion could corrupt the expanded words, but this can be prevented by temporarily disabling globbing for the assignment with set -f and then set +f.
Also, again, all empty fields will be lost, which may or may not be a problem depending on the application.
Wrong answer #4
string='first line
second line
third line'
oldIFS="$IFS"
IFS='
'
IFS=${IFS:0:1} # this is useful to format your code with tabs
lines=( $string )
IFS="$oldIFS"
This is similar to #2 and #3 in that it uses word splitting to get the job done, only now the code explicitly sets $IFS to contain only the single-character field delimiter present in the input string. It should be repeated that this cannot work for multicharacter field delimiters such as the OP's comma-space delimiter. But for a single-character delimiter like the LF used in this example, it actually comes close to being perfect. The fields cannot be unintentionally split in the middle as we saw with previous wrong answers, and there is only one level of splitting, as required.
One problem is that filename expansion will corrupt affected words as described earlier, although once again this can be solved by wrapping the critical statement in set -f and set +f.
Another potential problem is that, since LF qualifies as an "IFS whitespace character" as defined earlier, all empty fields will be lost, just as in #2 and #3. This would of course not be a problem if the delimiter happens to be a non-"IFS whitespace character", and depending on the application it may not matter anyway, but it does vitiate the generality of the solution.
So, to sum up, assuming you have a one-character delimiter, and it is either a non-"IFS whitespace character" or you don't care about empty fields, and you wrap the critical statement in set -f and set +f, then this solution works, but otherwise not.
(Also, for information's sake, assigning a LF to a variable in bash can be done more easily with the $'...' syntax, e.g. IFS=$'\n';.)
Wrong answer #5
countries='Paris, France, Europe'
OIFS="$IFS"
IFS=', ' array=($countries)
IFS="$OIFS"
Similar idea:
IFS=', ' eval 'array=($string)'
This solution is effectively a cross between #1 (in that it sets $IFS to comma-space) and #2-4 (in that it uses word splitting to split the string into fields). Because of this, it suffers from most of the problems that afflict all of the above wrong answers, sort of like the worst of all worlds.
Also, regarding the second variant, it may seem like the eval call is completely unnecessary, since its argument is a single-quoted string literal, and therefore is statically known. But there's actually a very non-obvious benefit to using eval in this way. Normally, when you run a simple command which consists of a variable assignment only, meaning without an actual command word following it, the assignment takes effect in the shell environment:
IFS=', '; ## changes $IFS in the shell environment
This is true even if the simple command involves multiple variable assignments; again, as long as there's no command word, all variable assignments affect the shell environment:
IFS=', ' array=($countries); ## changes both $IFS and $array in the shell environment
But, if the variable assignment is attached to a command name (I like to call this a "prefix assignment") then it does not affect the shell environment, and instead only affects the environment of the executed command, regardless whether it is a builtin or external:
IFS=', ' :; ## : is a builtin command, the $IFS assignment does not outlive it
IFS=', ' env; ## env is an external command, the $IFS assignment does not outlive it
Relevant quote from the bash manual:
If no command name results, the variable assignments affect the current shell environment. Otherwise, the variables are added to the environment of the executed command and do not affect the current shell environment.
It is possible to exploit this feature of variable assignment to change $IFS only temporarily, which allows us to avoid the whole save-and-restore gambit like that which is being done with the $OIFS variable in the first variant. But the challenge we face here is that the command we need to run is itself a mere variable assignment, and hence it would not involve a command word to make the $IFS assignment temporary. You might think to yourself, well why not just add a no-op command word to the statement like the : builtin to make the $IFS assignment temporary? This does not work because it would then make the $array assignment temporary as well:
IFS=', ' array=($countries) :; ## fails; new $array value never escapes the : command
So, we're effectively at an impasse, a bit of a catch-22. But, when eval runs its code, it runs it in the shell environment, as if it was normal, static source code, and therefore we can run the $array assignment inside the eval argument to have it take effect in the shell environment, while the $IFS prefix assignment that is prefixed to the eval command will not outlive the eval command. This is exactly the trick that is being used in the second variant of this solution:
IFS=', ' eval 'array=($string)'; ## $IFS does not outlive the eval command, but $array does
So, as you can see, it's actually quite a clever trick, and accomplishes exactly what is required (at least with respect to assignment effectation) in a rather non-obvious way. I'm actually not against this trick in general, despite the involvement of eval; just be careful to single-quote the argument string to guard against security threats.
But again, because of the "worst of all worlds" agglomeration of problems, this is still a wrong answer to the OP's requirement.
Wrong answer #6
IFS=', '; array=(Paris, France, Europe)
IFS=' ';declare -a array=(Paris France Europe)
Um... what? The OP has a string variable that needs to be parsed into an array. This "answer" starts with the verbatim contents of the input string pasted into an array literal. I guess that's one way to do it.
It looks like the answerer may have assumed that the $IFS variable affects all bash parsing in all contexts, which is not true. From the bash manual:
IFS The Internal Field Separator that is used for word splitting after expansion and to split lines into words with the read builtin command. The default value is <space><tab><newline>.
So the $IFS special variable is actually only used in two contexts: (1) word splitting that is performed after expansion (meaning not when parsing bash source code) and (2) for splitting input lines into words by the read builtin.
Let me try to make this clearer. I think it might be good to draw a distinction between parsing and execution. Bash must first parse the source code, which obviously is a parsing event, and then later it executes the code, which is when expansion comes into the picture. Expansion is really an execution event. Furthermore, I take issue with the description of the $IFS variable that I just quoted above; rather than saying that word splitting is performed after expansion, I would say that word splitting is performed during expansion, or, perhaps even more precisely, word splitting is part of the expansion process. The phrase "word splitting" refers only to this step of expansion; it should never be used to refer to the parsing of bash source code, although unfortunately the docs do seem to throw around the words "split" and "words" a lot. Here's a relevant excerpt from the linux.die.net version of the bash manual:
Expansion is performed on the command line after it has been split into words. There are seven kinds of expansion performed: brace expansion, tilde expansion, parameter and variable expansion, command substitution, arithmetic expansion, word splitting, and pathname expansion.
The order of expansions is: brace expansion; tilde expansion, parameter and variable expansion, arithmetic expansion, and command substitution (done in a left-to-right fashion); word splitting; and pathname expansion.
You could argue the GNU version of the manual does slightly better, since it opts for the word "tokens" instead of "words" in the first sentence of the Expansion section:
Expansion is performed on the command line after it has been split into tokens.
The important point is, $IFS does not change the way bash parses source code. Parsing of bash source code is actually a very complex process that involves recognition of the various elements of shell grammar, such as command sequences, command lists, pipelines, parameter expansions, arithmetic substitutions, and command substitutions. For the most part, the bash parsing process cannot be altered by user-level actions like variable assignments (actually, there are some minor exceptions to this rule; for example, see the various compatxx shell settings, which can change certain aspects of parsing behavior on-the-fly). The upstream "words"/"tokens" that result from this complex parsing process are then expanded according to the general process of "expansion" as broken down in the above documentation excerpts, where word splitting of the expanded (expanding?) text into downstream words is simply one step of that process. Word splitting only touches text that has been spit out of a preceding expansion step; it does not affect literal text that was parsed right off the source bytestream.
Wrong answer #7
string='first line
second line
third line'
while read -r line; do lines+=("$line"); done <<<"$string"
This is one of the best solutions. Notice that we're back to using read. Didn't I say earlier that read is inappropriate because it performs two levels of splitting, when we only need one? The trick here is that you can call read in such a way that it effectively only does one level of splitting, specifically by splitting off only one field per invocation, which necessitates the cost of having to call it repeatedly in a loop. It's a bit of a sleight of hand, but it works.
But there are problems. First: When you provide at least one NAME argument to read, it automatically ignores leading and trailing whitespace in each field that is split off from the input string. This occurs whether $IFS is set to its default value or not, as described earlier in this post. Now, the OP may not care about this for his specific use-case, and in fact, it may be a desirable feature of the parsing behavior. But not everyone who wants to parse a string into fields will want this. There is a solution, however: A somewhat non-obvious usage of read is to pass zero NAME arguments. In this case, read will store the entire input line that it gets from the input stream in a variable named $REPLY, and, as a bonus, it does not strip leading and trailing whitespace from the value. This is a very robust usage of read which I've exploited frequently in my shell programming career. Here's a demonstration of the difference in behavior:
string=$' a b \n c d \n e f '; ## input string
a=(); while read -r line; do a+=("$line"); done <<<"$string"; declare -p a;
## declare -a a=([0]="a b" [1]="c d" [2]="e f") ## read trimmed surrounding whitespace
a=(); while read -r; do a+=("$REPLY"); done <<<"$string"; declare -p a;
## declare -a a=([0]=" a b " [1]=" c d " [2]=" e f ") ## no trimming
The second issue with this solution is that it does not actually address the case of a custom field separator, such as the OP's comma-space. As before, multicharacter separators are not supported, which is an unfortunate limitation of this solution. We could try to at least split on comma by specifying the separator to the -d option, but look what happens:
string='Paris, France, Europe';
a=(); while read -rd,; do a+=("$REPLY"); done <<<"$string"; declare -p a;
## declare -a a=([0]="Paris" [1]=" France")
Predictably, the unaccounted surrounding whitespace got pulled into the field values, and hence this would have to be corrected subsequently through trimming operations (this could also be done directly in the while-loop). But there's another obvious error: Europe is missing! What happened to it? The answer is that read returns a failing return code if it hits end-of-file (in this case we can call it end-of-string) without encountering a final field terminator on the final field. This causes the while-loop to break prematurely and we lose the final field.
Technically this same error afflicted the previous examples as well; the difference there is that the field separator was taken to be LF, which is the default when you don't specify the -d option, and the <<< ("here-string") mechanism automatically appends a LF to the string just before it feeds it as input to the command. Hence, in those cases, we sort of accidentally solved the problem of a dropped final field by unwittingly appending an additional dummy terminator to the input. Let's call this solution the "dummy-terminator" solution. We can apply the dummy-terminator solution manually for any custom delimiter by concatenating it against the input string ourselves when instantiating it in the here-string:
a=(); while read -rd,; do a+=("$REPLY"); done <<<"$string,"; declare -p a;
declare -a a=([0]="Paris" [1]=" France" [2]=" Europe")
There, problem solved. Another solution is to only break the while-loop if both (1) read returned failure and (2) $REPLY is empty, meaning read was not able to read any characters prior to hitting end-of-file. Demo:
a=(); while read -rd,|| [[ -n "$REPLY" ]]; do a+=("$REPLY"); done <<<"$string"; declare -p a;
## declare -a a=([0]="Paris" [1]=" France" [2]=$' Europe\n')
This approach also reveals the secretive LF that automatically gets appended to the here-string by the <<< redirection operator. It could of course be stripped off separately through an explicit trimming operation as described a moment ago, but obviously the manual dummy-terminator approach solves it directly, so we could just go with that. The manual dummy-terminator solution is actually quite convenient in that it solves both of these two problems (the dropped-final-field problem and the appended-LF problem) in one go.
So, overall, this is quite a powerful solution. It's only remaining weakness is a lack of support for multicharacter delimiters, which I will address later.
Wrong answer #8
string='first line
second line
third line'
readarray -t lines <<<"$string"
(This is actually from the same post as #7; the answerer provided two solutions in the same post.)
The readarray builtin, which is a synonym for mapfile, is ideal. It's a builtin command which parses a bytestream into an array variable in one shot; no messing with loops, conditionals, substitutions, or anything else. And it doesn't surreptitiously strip any whitespace from the input string. And (if -O is not given) it conveniently clears the target array before assigning to it. But it's still not perfect, hence my criticism of it as a "wrong answer".
First, just to get this out of the way, note that, just like the behavior of read when doing field-parsing, readarray drops the trailing field if it is empty. Again, this is probably not a concern for the OP, but it could be for some use-cases. I'll come back to this in a moment.
Second, as before, it does not support multicharacter delimiters. I'll give a fix for this in a moment as well.
Third, the solution as written does not parse the OP's input string, and in fact, it cannot be used as-is to parse it. I'll expand on this momentarily as well.
For the above reasons, I still consider this to be a "wrong answer" to the OP's question. Below I'll give what I consider to be the right answer.
Right answer
Here's a naïve attempt to make #8 work by just specifying the -d option:
string='Paris, France, Europe';
readarray -td, a <<<"$string"; declare -p a;
## declare -a a=([0]="Paris" [1]=" France" [2]=$' Europe\n')
We see the result is identical to the result we got from the double-conditional approach of the looping read solution discussed in #7. We can almost solve this with the manual dummy-terminator trick:
readarray -td, a <<<"$string,"; declare -p a;
## declare -a a=([0]="Paris" [1]=" France" [2]=" Europe" [3]=$'\n')
The problem here is that readarray preserved the trailing field, since the <<< redirection operator appended the LF to the input string, and therefore the trailing field was not empty (otherwise it would've been dropped). We can take care of this by explicitly unsetting the final array element after-the-fact:
readarray -td, a <<<"$string,"; unset 'a[-1]'; declare -p a;
## declare -a a=([0]="Paris" [1]=" France" [2]=" Europe")
The only two problems that remain, which are actually related, are (1) the extraneous whitespace that needs to be trimmed, and (2) the lack of support for multicharacter delimiters.
The whitespace could of course be trimmed afterward (for example, see How to trim whitespace from a Bash variable?). But if we can hack a multicharacter delimiter, then that would solve both problems in one shot.
Unfortunately, there's no direct way to get a multicharacter delimiter to work. The best solution I've thought of is to preprocess the input string to replace the multicharacter delimiter with a single-character delimiter that will be guaranteed not to collide with the contents of the input string. The only character that has this guarantee is the NUL byte. This is because, in bash (though not in zsh, incidentally), variables cannot contain the NUL byte. This preprocessing step can be done inline in a process substitution. Here's how to do it using awk:
readarray -td '' a < <(awk '{ gsub(/, /,"\0"); print; }' <<<"$string, "); unset 'a[-1]';
declare -p a;
## declare -a a=([0]="Paris" [1]="France" [2]="Europe")
There, finally! This solution will not erroneously split fields in the middle, will not cut out prematurely, will not drop empty fields, will not corrupt itself on filename expansions, will not automatically strip leading and trailing whitespace, will not leave a stowaway LF on the end, does not require loops, and does not settle for a single-character delimiter.
Trimming solution
Lastly, I wanted to demonstrate my own fairly intricate trimming solution using the obscure -C callback option of readarray. Unfortunately, I've run out of room against Stack Overflow's draconian 30,000 character post limit, so I won't be able to explain it. I'll leave that as an exercise for the reader.
function mfcb { local val="$4"; "$1"; eval "$2[$3]=\$val;"; };
function val_ltrim { if [[ "$val" =~ ^[[:space:]]+ ]]; then val="${val:${#BASH_REMATCH[0]}}"; fi; };
function val_rtrim { if [[ "$val" =~ [[:space:]]+$ ]]; then val="${val:0:${#val}-${#BASH_REMATCH[0]}}"; fi; };
function val_trim { val_ltrim; val_rtrim; };
readarray -c1 -C 'mfcb val_trim a' -td, <<<"$string,"; unset 'a[-1]'; declare -p a;
## declare -a a=([0]="Paris" [1]="France" [2]="Europe")
Here is a way without setting IFS:
string="1:2:3:4:5"
set -f # avoid globbing (expansion of *).
array=(${string//:/ })
for i in "${!array[#]}"
do
echo "$i=>${array[i]}"
done
The idea is using string replacement:
${string//substring/replacement}
to replace all matches of $substring with white space and then using the substituted string to initialize a array:
(element1 element2 ... elementN)
Note: this answer makes use of the split+glob operator. Thus, to prevent expansion of some characters (such as *) it is a good idea to pause globbing for this script.
t="one,two,three"
a=($(echo "$t" | tr ',' '\n'))
echo "${a[2]}"
Prints three
Sometimes it happened to me that the method described in the accepted answer didn't work, especially if the separator is a carriage return.
In those cases I solved in this way:
string='first line
second line
third line'
oldIFS="$IFS"
IFS='
'
IFS=${IFS:0:1} # this is useful to format your code with tabs
lines=( $string )
IFS="$oldIFS"
for line in "${lines[#]}"
do
echo "--> $line"
done
The accepted answer works for values in one line. If the variable has several lines:
string='first line
second line
third line'
We need a very different command to get all lines:
while read -r line; do lines+=("$line"); done <<<"$string"
Or the much simpler bash readarray:
readarray -t lines <<<"$string"
Printing all lines is very easy taking advantage of a printf feature:
printf ">[%s]\n" "${lines[#]}"
>[first line]
>[ second line]
>[ third line]
if you use macOS and can't use readarray, you can simply do this-
MY_STRING="string1 string2 string3"
array=($MY_STRING)
To iterate over the elements:
for element in "${array[#]}"
do
echo $element
done
This works for me on OSX:
string="1 2 3 4 5"
declare -a array=($string)
If your string has different delimiter, just 1st replace those with space:
string="1,2,3,4,5"
delimiter=","
declare -a array=($(echo $string | tr "$delimiter" " "))
Simple :-)
This is similar to the approach by Jmoney38, but using sed:
string="1,2,3,4"
array=(`echo $string | sed 's/,/\n/g'`)
echo ${array[0]}
Prints 1
The key to splitting your string into an array is the multi character delimiter of ", ". Any solution using IFS for multi character delimiters is inherently wrong since IFS is a set of those characters, not a string.
If you assign IFS=", " then the string will break on EITHER "," OR " " or any combination of them which is not an accurate representation of the two character delimiter of ", ".
You can use awk or sed to split the string, with process substitution:
#!/bin/bash
str="Paris, France, Europe"
array=()
while read -r -d $'\0' each; do # use a NUL terminated field separator
array+=("$each")
done < <(printf "%s" "$str" | awk '{ gsub(/,[ ]+|$/,"\0"); print }')
declare -p array
# declare -a array=([0]="Paris" [1]="France" [2]="Europe") output
It is more efficient to use a regex you directly in Bash:
#!/bin/bash
str="Paris, France, Europe"
array=()
while [[ $str =~ ([^,]+)(,[ ]+|$) ]]; do
array+=("${BASH_REMATCH[1]}") # capture the field
i=${#BASH_REMATCH} # length of field + delimiter
str=${str:i} # advance the string by that length
done # the loop deletes $str, so make a copy if needed
declare -p array
# declare -a array=([0]="Paris" [1]="France" [2]="Europe") output...
With the second form, there is no sub shell and it will be inherently faster.
Edit by bgoldst: Here are some benchmarks comparing my readarray solution to dawg's regex solution, and I also included the read solution for the heck of it (note: I slightly modified the regex solution for greater harmony with my solution) (also see my comments below the post):
## competitors
function c_readarray { readarray -td '' a < <(awk '{ gsub(/, /,"\0"); print; };' <<<"$1, "); unset 'a[-1]'; };
function c_read { a=(); local REPLY=''; while read -r -d ''; do a+=("$REPLY"); done < <(awk '{ gsub(/, /,"\0"); print; };' <<<"$1, "); };
function c_regex { a=(); local s="$1, "; while [[ $s =~ ([^,]+),\ ]]; do a+=("${BASH_REMATCH[1]}"); s=${s:${#BASH_REMATCH}}; done; };
## helper functions
function rep {
local -i i=-1;
for ((i = 0; i<$1; ++i)); do
printf %s "$2";
done;
}; ## end rep()
function testAll {
local funcs=();
local args=();
local func='';
local -i rc=-1;
while [[ "$1" != ':' ]]; do
func="$1";
if [[ ! "$func" =~ ^[_a-zA-Z][_a-zA-Z0-9]*$ ]]; then
echo "bad function name: $func" >&2;
return 2;
fi;
funcs+=("$func");
shift;
done;
shift;
args=("$#");
for func in "${funcs[#]}"; do
echo -n "$func ";
{ time $func "${args[#]}" >/dev/null 2>&1; } 2>&1| tr '\n' '/';
rc=${PIPESTATUS[0]}; if [[ $rc -ne 0 ]]; then echo "[$rc]"; else echo; fi;
done| column -ts/;
}; ## end testAll()
function makeStringToSplit {
local -i n=$1; ## number of fields
if [[ $n -lt 0 ]]; then echo "bad field count: $n" >&2; return 2; fi;
if [[ $n -eq 0 ]]; then
echo;
elif [[ $n -eq 1 ]]; then
echo 'first field';
elif [[ "$n" -eq 2 ]]; then
echo 'first field, last field';
else
echo "first field, $(rep $[$1-2] 'mid field, ')last field";
fi;
}; ## end makeStringToSplit()
function testAll_splitIntoArray {
local -i n=$1; ## number of fields in input string
local s='';
echo "===== $n field$(if [[ $n -ne 1 ]]; then echo 's'; fi;) =====";
s="$(makeStringToSplit "$n")";
testAll c_readarray c_read c_regex : "$s";
}; ## end testAll_splitIntoArray()
## results
testAll_splitIntoArray 1;
## ===== 1 field =====
## c_readarray real 0m0.067s user 0m0.000s sys 0m0.000s
## c_read real 0m0.064s user 0m0.000s sys 0m0.000s
## c_regex real 0m0.000s user 0m0.000s sys 0m0.000s
##
testAll_splitIntoArray 10;
## ===== 10 fields =====
## c_readarray real 0m0.067s user 0m0.000s sys 0m0.000s
## c_read real 0m0.064s user 0m0.000s sys 0m0.000s
## c_regex real 0m0.001s user 0m0.000s sys 0m0.000s
##
testAll_splitIntoArray 100;
## ===== 100 fields =====
## c_readarray real 0m0.069s user 0m0.000s sys 0m0.062s
## c_read real 0m0.065s user 0m0.000s sys 0m0.046s
## c_regex real 0m0.005s user 0m0.000s sys 0m0.000s
##
testAll_splitIntoArray 1000;
## ===== 1000 fields =====
## c_readarray real 0m0.084s user 0m0.031s sys 0m0.077s
## c_read real 0m0.092s user 0m0.031s sys 0m0.046s
## c_regex real 0m0.125s user 0m0.125s sys 0m0.000s
##
testAll_splitIntoArray 10000;
## ===== 10000 fields =====
## c_readarray real 0m0.209s user 0m0.093s sys 0m0.108s
## c_read real 0m0.333s user 0m0.234s sys 0m0.109s
## c_regex real 0m9.095s user 0m9.078s sys 0m0.000s
##
testAll_splitIntoArray 100000;
## ===== 100000 fields =====
## c_readarray real 0m1.460s user 0m0.326s sys 0m1.124s
## c_read real 0m2.780s user 0m1.686s sys 0m1.092s
## c_regex real 17m38.208s user 15m16.359s sys 2m19.375s
##
enter code herePure bash multi-character delimiter solution.
As others have pointed out in this thread, the OP's question gave an example of a comma delimited string to be parsed into an array, but did not indicate if he/she was only interested in comma delimiters, single character delimiters, or multi-character delimiters.
Since Google tends to rank this answer at or near the top of search results, I wanted to provide readers with a strong answer to the question of multiple character delimiters, since that is also mentioned in at least one response.
If you're in search of a solution to a multi-character delimiter problem, I suggest reviewing Mallikarjun M's post, in particular the response from gniourf_gniourf
who provides this elegant pure BASH solution using parameter expansion:
#!/bin/bash
str="LearnABCtoABCSplitABCaABCString"
delimiter=ABC
s=$str$delimiter
array=();
while [[ $s ]]; do
array+=( "${s%%"$delimiter"*}" );
s=${s#*"$delimiter"};
done;
declare -p array
Link to cited comment/referenced post
Link to cited question: Howto split a string on a multi-character delimiter in bash?
Update 3 Aug 2022
xebeche raised a good point in comments below. After reviewing their suggested edits, I've revised the script provided by gniourf_gniourf, and added remarks for ease of understanding what the script is doing. I also changed the double brackets [[]] to single brackets, for greater compatibility since many SHell variants do not support double bracket notation. In this case, for BaSH, the logic works inside single or double brackets.
#!/bin/bash
str="LearnABCtoABCSplitABCABCaABCStringABC"
delimiter="ABC"
array=()
while [ "$str" ]; do
# parse next sub-string, left of next delimiter
substring="${str%%"$delimiter"*}"
# when substring = delimiter, truncate leading delimiter
# (i.e. pattern is "$delimiter$delimiter")
[ -z "$substring" ] && str="${str#"$delimiter"}" && continue
# create next array element with parsed substring
array+=( "$substring" )
# remaining string to the right of delimiter becomes next string to be evaluated
str="${str:${#substring}}"
# prevent infinite loop when last substring = delimiter
[ "$str" == "$delimiter" ] && break
done
declare -p array
Without the comments:
#!/bin/bash
str="LearnABCtoABCSplitABCABCaABCStringABC"
delimiter="ABC"
array=()
while [ "$str" ]; do
substring="${str%%"$delimiter"*}"
[ -z "$substring" ] && str="${str#"$delimiter"}" && continue
array+=( "$substring" )
str="${str:${#substring}}"
[ "$str" == "$delimiter" ] && break
done
declare -p array
I was curious about the relative performance of the "Right answer"
in the popular answer by #bgoldst, with its apparent decrying of loops,
so I have done a simple benchmark of it against three pure bash implementations.
In summary, I suggest:
for string length < 4k or so, pure bash is faster than gawk
for delimiter length < 10 and string length < 256k, pure bash is comparable to gawk
for delimiter length >> 10 and string length < 64k or so, pure bash is "acceptable";
and gawk is less than 5x faster
for string length < 512k or so, gawk is "acceptable"
I arbitrarily define "acceptable" as "takes < 0.5s to split the string".
I am taking the problem to be to take a bash string and split it into a bash array, using an arbitrary-length delimiter string (not regex).
# in: $1=delim, $2=string
# out: sets array a
My pure bash implementations are:
# naive approach - slow
split_byStr_bash_naive(){
a=()
local prev=""
local cdr="$2"
[[ -z "${cdr}" ]] && a+=("")
while [[ "$cdr" != "$prev" ]]; do
prev="$cdr"
a+=( "${cdr%%"$1"*}" )
cdr="${cdr#*"$1"}"
done
# echo $( declare -p a | md5sum; declare -p a )
}
# use lengths wherever possible - faster
split_byStr_bash_faster(){
a=()
local car=""
local cdr="$2"
while
car="${cdr%%"$1"*}"
a+=("$car")
cdr="${cdr:${#car}}"
(( ${#cdr} ))
do
cdr="${cdr:${#1}}"
done
# echo $( declare -p a | md5sum; declare -p a )
}
# use pattern substitution and readarray - fastest
split_byStr_bash_sub(){
a=()
local delim="$1" string="$2"
delim="${delim//=/=-}"
delim="${delim//$'\n'/=n}"
string="${string//=/=-}"
string="${string//$'\n'/=n}"
readarray -td $'\n' a <<<"${string//"$delim"/$'\n'}"
local len=${#a[#]} i s
for (( i=0; i<len; i++ )); do
s="${a[$i]//=n/$'\n'}"
a[$i]="${s//=-/=}"
done
# echo $( declare -p a | md5sum; declare -p a )
}
The initial -z test in in the naive version handles the case of a zero-length
string being passed. Without the test, the output array is empty;
with it, the array has a single zero-length element.
Replacing readarray with while read gives < 10% slowdown.
This is the gawk implementation I used:
split_byRE_gawk(){
readarray -td '' a < <(awk '{gsub(/'"$1"'/,"\0")}1' <<<"$2$1")
unset 'a[-1]'
# echo $( declare -p a | md5sum; declare -p a )
}
Obviously, in the general case, the delim argument will need to be sanitised,
as gawk expects a regex, and gawk-special characters could cause problems.
Also, as-is, the implementation won't correctly handle newlines in the delimiter.
Since gawk is being used, a generalised version that handles more arbitrary
delimiters could be:
split_byREorStr_gawk(){
local delim=$1
local string=$2
local useRegex=${3:+1} # if set, delimiter is regex
readarray -td '' a < <(
export delim
gawk -v re="$useRegex" '
BEGIN {
RS = FS = "\0"
ORS = ""
d = ENVIRON["delim"]
# cf. https://stackoverflow.com/a/37039138
if (!re) gsub(/[\\.^$(){}\[\]|*+?]/,"\\\\&",d)
}
gsub(d"|\n$","\0")
' <<<"$string"
)
# echo $( declare -p a | md5sum; declare -p a )
}
or the same idea in Perl:
split_byREorStr_perl(){
local delim=$1
local string=$2
local regex=$3 # if set, delimiter is regex
readarray -td '' a < <(
export delim regex
perl -0777pe '
$d = $ENV{delim};
$d = "\Q$d\E" if ! $ENV{regex};
s/$d|\n$/\0/g;
' <<<"$string"
)
# echo $( declare -p a | md5sum; declare -p a )
}
The implementations produce identical output, tested by comparing md5sum separately.
Note that if input had been ambiguous ("logically incorrect" as #bgoldst puts it),
behaviour would diverge slightly. For example, with delimiter -- and string a- or a---:
#goldst's code returns: declare -a a=([0]="a") or declare -a a=([0]="a" [1]="")
mine return: declare -a a=([0]="a-") or declare -a a=([0]="a" [1]="-")
Arguments were derived with simple Perl scripts from:
delim="-=-="
base="ABCDEFGHIJKLMNOPQRSTUVWXYZ012345"
Here are the tables of timing results (in seconds) for 3 different types
of string and delimiter argument.
#s - length of string argument
#d - length of delim argument
= - performance break-even point
! - "acceptable" performance limit (bash) is somewhere around here
!! - "acceptable" performance limit (gawk) is somewhere around here
- - function took too long
<!> - gawk command failed to run
Type 1
d=$(perl -e "print( '$delim' x (7*2**$n) )")
s=$(perl -e "print( '$delim' x (7*2**$n) . '$base' x (7*2**$n) )")
n
#s
#d
gawk
b_sub
b_faster
b_naive
0
252
28
0.002
0.000
0.000
0.000
1
504
56
0.005
0.000
0.000
0.001
2
1008
112
0.005
0.001
0.000
0.003
3
2016
224
0.006
0.001
0.000
0.009
4
4032
448
0.007
0.002
0.001
0.048
=
5
8064
896
0.014
0.008
0.005
0.377
6
16128
1792
0.018
0.029
0.017
(2.214)
7
32256
3584
0.033
0.057
0.039
(15.16)
!
8
64512
7168
0.063
0.214
0.128
-
9
129024
14336
0.111
(0.826)
(0.602)
-
10
258048
28672
0.214
(3.383)
(2.652)
-
!!
11
516096
57344
0.430
(13.46)
(11.00)
-
12
1032192
114688
(0.834)
(58.38)
-
-
13
2064384
229376
<!>
(228.9)
-
-
Type 2
d=$(perl -e "print( '$delim' x ($n) )")
s=$(perl -e "print( ('$delim' x ($n) . '$base' x $n ) x (2**($n-1)) )")
n
#s
#d
gawk
b_sub
b_faster
b_naive
0
0
0
0.003
0.000
0.000
0.000
1
36
4
0.003
0.000
0.000
0.000
2
144
8
0.005
0.000
0.000
0.000
3
432
12
0.005
0.000
0.000
0.000
4
1152
16
0.005
0.001
0.001
0.002
5
2880
20
0.005
0.001
0.002
0.003
6
6912
24
0.006
0.003
0.009
0.014
=
7
16128
28
0.012
0.012
0.037
0.044
8
36864
32
0.023
0.044
0.167
0.187
!
9
82944
36
0.049
0.192
(0.753)
(0.840)
10
184320
40
0.097
(0.925)
(3.682)
(4.016)
11
405504
44
0.204
(4.709)
(18.00)
(19.58)
!!
12
884736
48
0.444
(22.17)
-
-
13
1916928
52
(1.019)
(102.4)
-
-
Type 3
d=$(perl -e "print( '$delim' x (2**($n-1)) )")
s=$(perl -e "print( ('$delim' x (2**($n-1)) . '$base' x (2**($n-1)) ) x ($n) )")
n
#s
#d
gawk
b_sub
b_faster
b_naive
0
0
0
0.000
0.000
0.000
0.000
1
36
4
0.004
0.000
0.000
0.000
2
144
8
0.003
0.000
0.000
0.000
3
432
16
0.003
0.000
0.000
0.000
4
1152
32
0.005
0.001
0.001
0.002
5
2880
64
0.005
0.002
0.001
0.003
6
6912
128
0.006
0.003
0.003
0.014
=
7
16128
256
0.012
0.011
0.010
0.077
8
36864
512
0.023
0.046
0.046
(0.513)
!
9
82944
1024
0.049
0.195
0.197
(3.850)
10
184320
2048
0.103
(0.951)
(1.061)
(31.84)
11
405504
4096
0.222
(4.796)
-
-
!!
12
884736
8192
0.473
(22.88)
-
-
13
1916928
16384
(1.126)
(105.4)
-
-
Summary of delimiters length 1..10
As short delimiters are probably more likely than long,
summarised below are the results of varying delimiter length
between 1 and 10 (results for 2..9 mostly elided as very similar).
s1=$(perl -e "print( '$d' . '$base' x (7*2**$n) )")
s2=$(perl -e "print( ('$d' . '$base' x $n ) x (2**($n-1)) )")
s3=$(perl -e "print( ('$d' . '$base' x (2**($n-1)) ) x ($n) )")
bash_sub < gawk
string
n
#s
#d
gawk
b_sub
b_faster
b_naive
s1
10
229377
1
0.131
0.089
1.709
-
s1
10
229386
10
0.142
0.095
1.907
-
s2
8
32896
1
0.022
0.007
0.148
0.168
s2
8
34048
10
0.021
0.021
0.163
0.179
s3
12
786444
1
0.436
0.468
-
-
s3
12
786456
2
0.434
0.317
-
-
s3
12
786552
10
0.438
0.333
-
-
bash_sub < 0.5s
string
n
#s
#d
gawk
b_sub
b_faster
b_naive
s1
11
458753
1
0.256
0.332
(7.089)
-
s1
11
458762
10
0.269
0.387
(8.003)
-
s2
11
361472
1
0.205
0.283
(14.54)
-
s2
11
363520
3
0.207
0.462
(16.66)
-
s3
12
786444
1
0.436
0.468
-
-
s3
12
786456
2
0.434
0.317
-
-
s3
12
786552
10
0.438
0.333
-
-
gawk < 0.5s
string
n
#s
$d
gawk
b_sub
b_faster
b_naive
s1
11
458753
1
0.256
0.332
(7.089)
-
s1
11
458762
10
0.269
0.387
(8.003)
-
s2
12
788480
1
0.440
(1.252)
-
-
s2
12
806912
10
0.449
(4.968)
-
-
s3
12
786444
1
0.436
0.468
-
-
s3
12
786456
2
0.434
0.317
-
-
s3
12
786552
10
0.438
0.333
-
-
(I'm not entirely sure why bash_sub with s>160k and d=1 was consistently slower than d>1 for s3.)
All tests carried out with bash 5.0.17 on an Intel i7-7500U running xubuntu 20.04.
Try this
IFS=', '; array=(Paris, France, Europe)
for item in ${array[#]}; do echo $item; done
It's simple. If you want, you can also add a declare (and also remove the commas):
IFS=' ';declare -a array=(Paris France Europe)
The IFS is added to undo the above but it works without it in a fresh bash instance
#!/bin/bash
string="a | b c"
pattern=' | '
# replaces pattern with newlines
splitted="$(sed "s/$pattern/\n/g" <<< "$string")"
# Reads lines and put them in array
readarray -t array2 <<< "$splitted"
# Prints number of elements
echo ${#array2[#]}
# Prints all elements
for a in "${array2[#]}"; do
echo "> '$a'"
done
This solution works for larger delimiters (more than one char).
Doesn't work if you have a newline already in the original string
This works for the given data:
$ aaa='Paris, France, Europe'
$ mapfile -td ',' aaaa < <(echo -n "${aaa//, /,}")
$ declare -p aaaa
Result:
declare -a aaaa=([0]="Paris" [1]="France" [2]="Europe")
And it will also work for extended data with spaces, such as "New York":
$ aaa="New York, Paris, New Jersey, Hampshire"
$ mapfile -td ',' aaaa < <(echo -n "${aaa//, /,}")
$ declare -p aaaa
Result:
declare -a aaaa=([0]="New York" [1]="Paris" [2]="New Jersey" [3]="Hampshire")
Another way to do it without modifying IFS:
read -r -a myarray <<< "${string//, /$IFS}"
Rather than changing IFS to match our desired delimiter, we can replace all occurrences of our desired delimiter ", " with contents of $IFS via "${string//, /$IFS}".
Maybe this will be slow for very large strings though?
This is based on Dennis Williamson's answer.
I came across this post when looking to parse an input like:
word1,word2,...
none of the above helped me. solved it by using awk. If it helps someone:
STRING="value1,value2,value3"
array=`echo $STRING | awk -F ',' '{ s = $1; for (i = 2; i <= NF; i++) s = s "\n"$i; print s; }'`
for word in ${array}
do
echo "This is the word $word"
done
UPDATE: Don't do this, due to problems with eval.
With slightly less ceremony:
IFS=', ' eval 'array=($string)'
e.g.
string="foo, bar,baz"
IFS=', ' eval 'array=($string)'
echo ${array[1]} # -> bar
Do not change IFS!
Here's a simple bash one-liner:
read -a my_array <<< $(echo ${INPUT_STRING} | tr -d ' ' | tr ',' ' ')
Here's my hack!
Splitting strings by strings is a pretty boring thing to do using bash. What happens is that we have limited approaches that only work in a few cases (split by ";", "/", "." and so on) or we have a variety of side effects in the outputs.
The approach below has required a number of maneuvers, but I believe it will work for most of our needs!
#!/bin/bash
# --------------------------------------
# SPLIT FUNCTION
# ----------------
F_SPLIT_R=()
f_split() {
: 'It does a "split" into a given string and returns an array.
Args:
TARGET_P (str): Target string to "split".
DELIMITER_P (Optional[str]): Delimiter used to "split". If not
informed the split will be done by spaces.
Returns:
F_SPLIT_R (array): Array with the provided string separated by the
informed delimiter.
'
F_SPLIT_R=()
TARGET_P=$1
DELIMITER_P=$2
if [ -z "$DELIMITER_P" ] ; then
DELIMITER_P=" "
fi
REMOVE_N=1
if [ "$DELIMITER_P" == "\n" ] ; then
REMOVE_N=0
fi
# NOTE: This was the only parameter that has been a problem so far!
# By Questor
# [Ref.: https://unix.stackexchange.com/a/390732/61742]
if [ "$DELIMITER_P" == "./" ] ; then
DELIMITER_P="[.]/"
fi
if [ ${REMOVE_N} -eq 1 ] ; then
# NOTE: Due to bash limitations we have some problems getting the
# output of a split by awk inside an array and so we need to use
# "line break" (\n) to succeed. Seen this, we remove the line breaks
# momentarily afterwards we reintegrate them. The problem is that if
# there is a line break in the "string" informed, this line break will
# be lost, that is, it is erroneously removed in the output!
# By Questor
TARGET_P=$(awk 'BEGIN {RS="dn"} {gsub("\n", "3F2C417D448C46918289218B7337FCAF"); printf $0}' <<< "${TARGET_P}")
fi
# NOTE: The replace of "\n" by "3F2C417D448C46918289218B7337FCAF" results
# in more occurrences of "3F2C417D448C46918289218B7337FCAF" than the
# amount of "\n" that there was originally in the string (one more
# occurrence at the end of the string)! We can not explain the reason for
# this side effect. The line below corrects this problem! By Questor
TARGET_P=${TARGET_P%????????????????????????????????}
SPLIT_NOW=$(awk -F"$DELIMITER_P" '{for(i=1; i<=NF; i++){printf "%s\n", $i}}' <<< "${TARGET_P}")
while IFS= read -r LINE_NOW ; do
if [ ${REMOVE_N} -eq 1 ] ; then
# NOTE: We use "'" to prevent blank lines with no other characters
# in the sequence being erroneously removed! We do not know the
# reason for this side effect! By Questor
LN_NOW_WITH_N=$(awk 'BEGIN {RS="dn"} {gsub("3F2C417D448C46918289218B7337FCAF", "\n"); printf $0}' <<< "'${LINE_NOW}'")
# NOTE: We use the commands below to revert the intervention made
# immediately above! By Questor
LN_NOW_WITH_N=${LN_NOW_WITH_N%?}
LN_NOW_WITH_N=${LN_NOW_WITH_N#?}
F_SPLIT_R+=("$LN_NOW_WITH_N")
else
F_SPLIT_R+=("$LINE_NOW")
fi
done <<< "$SPLIT_NOW"
}
# --------------------------------------
# HOW TO USE
# ----------------
STRING_TO_SPLIT="
* How do I list all databases and tables using psql?
\"
sudo -u postgres /usr/pgsql-9.4/bin/psql -c \"\l\"
sudo -u postgres /usr/pgsql-9.4/bin/psql <DB_NAME> -c \"\dt\"
\"
\"
\list or \l: list all databases
\dt: list all tables in the current database
\"
[Ref.: https://dba.stackexchange.com/questions/1285/how-do-i-list-all-databases-and-tables-using-psql]
"
f_split "$STRING_TO_SPLIT" "bin/psql -c"
# --------------------------------------
# OUTPUT AND TEST
# ----------------
ARR_LENGTH=${#F_SPLIT_R[*]}
for (( i=0; i<=$(( $ARR_LENGTH -1 )); i++ )) ; do
echo " > -----------------------------------------"
echo "${F_SPLIT_R[$i]}"
echo " < -----------------------------------------"
done
if [ "$STRING_TO_SPLIT" == "${F_SPLIT_R[0]}bin/psql -c${F_SPLIT_R[1]}" ] ; then
echo " > -----------------------------------------"
echo "The strings are the same!"
echo " < -----------------------------------------"
fi
For multilined elements, why not something like
$ array=($(echo -e $'a a\nb b' | tr ' ' '§')) && array=("${array[#]//§/ }") && echo "${array[#]/%/ INTERELEMENT}"
a a INTERELEMENT b b INTERELEMENT
Since there are so many ways to solve this, let's start by defining what we want to see in our solution.
Bash provides a builtin readarray for this purpose. Let's use it.
Avoid ugly and unnecessary tricks such as changing IFS, looping, using eval, or adding an extra element then removing it.
Find a simple, readable approach that can easily be adapted to similar problems.
The readarray command is easiest to use with newlines as the delimiter. With other delimiters it may add an extra element to the array. The cleanest approach is to first adapt our input into a form that works nicely with readarray before passing it in.
The input in this example does not have a multi-character delimiter. If we apply a little common sense, it's best understood as comma separated input for which each element may need to be trimmed. My solution is to split the input by comma into multiple lines, trim each element, and pass it all to readarray.
string=' Paris,France , All of Europe '
readarray -t foo < <(tr ',' '\n' <<< "$string" |sed 's/^ *//' |sed 's/ *$//')
# Result:
declare -p foo
# declare -a foo='([0]="Paris" [1]="France" [2]="All of Europe")'
EDIT: My solution allows inconsistent spacing around comma separators, while also allowing elements to contain spaces. Few other solutions can handle these special cases.
I also avoid approaches which seem like hacks, such as creating an extra array element and then removing it. If you don't agree it's the best answer here, please leave a comment to explain.
If you'd like to try the same approach purely in Bash and with fewer subshells, it's possible. But the result is harder to read, and this optimization is probably unnecessary.
string=' Paris,France , All of Europe '
foo="${string#"${string%%[![:space:]]*}"}"
foo="${foo%"${foo##*[![:space:]]}"}"
foo="${foo//+([[:space:]]),/,}"
foo="${foo//,+([[:space:]])/,}"
readarray -t foo < <(echo "$foo")
Another way would be:
string="Paris, France, Europe"
IFS=', ' arr=(${string})
Now your elements are stored in "arr" array.
To iterate through the elements:
for i in ${arr[#]}; do echo $i; done
Another approach can be:
str="a, b, c, d" # assuming there is a space after ',' as in Q
arr=(${str//,/}) # delete all occurrences of ','
After this 'arr' is an array with four strings.
This doesn't require dealing IFS or read or any other special stuff hence much simpler and direct.

Resources