How to Convert String in a Text File into Array in Bash - arrays

I have text file containing quotes inside single quotes. They are "not" one liners. eg there may be two quotes in same line, but all quotes are inside single quotes like
'hello world' 'this is the second quotes' 'and this is the third quoted text'
how can I create an array and making each quoted text an element of the array. I've tried using
declare -a arr=($(cat file.txt))
but that makes it space separated. assigns each word an element in the array

If you have bash v4.4 or later, you can use xargs to parse the quoted strings and turn them into null-delimited strings, then readarray to turn that into a bash array:
readarray -t -d '' arr < <(xargs printf '%s\0' <file.txt)
If you have an older version of bash, you'll have to create the array element-by-element:
arr=( )
while IFS= read -r -d '' quote; do
arr+=( "$quote" )
done < <(xargs printf '%s\0' <file.txt)
Note that xargs quote syntax is a little different from everything else (of course). It allows both single- and double-quoted strings, but doesn't allow escaped quotes within those strings. And it probably varies a bit between versions of xargs.

Related

Bash Array not sorting correctly

I need to order these 2 arrays, I don't care about the format of the output, I only need it to be ordered in order to compare them but this doesn't seem to work, although it works with simpler text. I also tried removing the --field-separator='"'
DIG_1=("sampletext""zzz""ms=ms91608007""asdas")
DIG_2=("zzz""ms=ms91608007""sampletext""asdas")
echo "unsorted:"
echo ${DIG_1[*]}
echo ${DIG_2[*]}
IFS=$'\n' sorted=($(sort --field-separator='"' <<<"${DIG_1[*]}")); unset IFS
IFS=$'\n' sorted2=($(sort --field-separator='"' <<<"${DIG_2[*]}")); unset IFS
echo "sorted:"
echo ${sorted[*]}
echo ${sorted2[*]}
And the output I get is:
unsorted:
sampletextzzzms=ms91608007asdas
zzzms=ms91608007sampletextasdas
sorted:
sampletextzzzms=ms91608007asdas
zzzms=ms91608007sampletextasdas
How Can I fix this? I want it to be, for example:
unsorted:
sampletextzzzms=ms91608007asdas
zzzms=ms91608007sampletextasdas
sorted:
asdasms=ms91608007sampletextzzz
asdasms=ms91608007sampletextzzz
There's no reason to use an array to store one element.
Since you need to keep the double quotes, you need to make efforts to preserve them:
DIG_1='"sampletext""zzz""ms=ms91608007""asdas"'
Otherwise the double quotes will be removed by the shell: 3.5.9 Quote Removal
When you use VAR=value some_command, that variable is only
set for the duration of some_command -- bash puts that variable in the environment for the command, not into the shell's own catalog of variables. Subsequently unsetting the
variable is not required -- unsetting the IFS variable is potentially harmful
for the rest of the program
sort won't sort the fields within a record, it's for sorting records against each other.
To accomplish what you want, this will do:
sorted_1=$(grep -Po '(?<=").*?(?=")' <<<"$DIG_1" | sort | paste -s -d "")
As anubhava mentioned in the comments, the current code is creating arrays of single values, ie:
$ DIG_1=("sampletext""zzz""ms=ms91608007""asdas")
$ typeset -p DIG_1
declare -a DIG_1=([0]="sampletextzzzms=ms91608007asdas")
$ DIG_2=("zzz""ms=ms91608007""sampletext""asdas")
$ typeset -p DIG_2
declare -a DIG_2=([0]="zzzms=ms91608007sampletextasdas")
Assuming the OP really does want an array, and that the array elements will be utilized in later code, we need a way to delimit the items of the array, and the easiest way to do this is with some white space, eg:
$ DIG_1=("sample text" "zzz" "ms=ms91608007" "asdas")
$ typeset -p DIG_1
declare -a DIG_1=([0]="sample text" [1]="zzz" [2]="ms=ms91608007" [3]="asdas")
$ DIG_2=("zzz" "ms=ms91608007" "sample text" "asdas")
$ typeset -p DIG_2
declare -a DIG_2=([0]="zzz" [1]="ms=ms91608007" [2]="sample text" [3]="asdas")
NOTE: I've added a single space to change "sampletext" to "sample text" so that we can see how a space is treated a) as part of the data vs b) as a delimiter.
NOTE: Assuming OPs code is generating the questionable array assignments (eg, DIG_1=("sampletext""zzz""ms=ms91608007""asdas")), it may make more sense to look into ways to 'fix' the array generator than to complicate the code by trying to figure out how to treat these single strings as a 4-part array definition.
Also, since the sample output (current vs desired) shows no double quotes I'm guessing this means the double quotes are not part of the actual data but rather just delimiters.
Now that we have an actual array of elements we can look at sorting the arrays and storing the results into additional (sorted) arrays, eg:
$ IFS=$'\n' sorted=($(printf "%s\n" "${DIG_1[#]}" | sort))
$ typeset -p sorted
declare -a sorted=([0]="asdas" [1]="ms=ms91608007" [2]="sample text" [3]="zzz")
$ IFS=$'\n' sorted2=($(printf "%s\n" "${DIG_2[#]}" | sort))
$ typeset -p sorted2
declare -a sorted2=([0]="asdas" [1]="ms=ms91608007" [2]="sample text" [3]="zzz")
At this point we now have 2 sets of arrays ... 1) original data (DIG_1[#] and DIG_2[#]) and 2) sorted (sorted[#] and sorted2[#]).
The OP can then slice-n-dice the data as desired, as well as print the contents of the arrays in any desired format, eg:
# print array elements on a single line with no delimiters, storing the results
# in variables for later use/comparison/display
$ printf -v srt "%s" "${sorted[#]}"
$ typeset -p srt
declare -- srt="asdasms=ms91608007sample textzzz"
$ echo "${srt}"
asdasms=ms91608007sample textzzz
$ printf -v srt2 "%s" "${sorted2[#]}"
$ typeset -p srt2
declare -- srt2="asdasms=ms91608007sample textzzz"
$ echo "${srt2}"
asdasms=ms91608007sample textzzz

Bash: set IFS to Space after specific character only?

I'm using IFS=', ' to split a string of comma-delimited text into an array. The problem is that occasionally one of the comma-delimited items contains a space following a :. The resulting array contains that item as two separate array elements. Is it possible to set IFS to only split ', ' and ignore a comma-delimited item that contains ': ' (or any other character for that matter)?
See the comma-delimited string returned from the first command below, note the second item has the :. See the MarkerNames[1] and MarkerNames[2] to see the unwanted split in the second command below.
$ exiftool -s3 -TracksMarkersName audioFile.wav
Marker1, Tempo: 120.0, Silence, Marker2, Silence.1, Marker3, Silence.2, Marker4, Silence.3, Marker5
$ IFS=', ' read -r -a MarkerNames <<< $(exiftool -s3 -TracksMarkersName audioFile.wav)
$ declare -p MarkerNames
declare -a MarkerNames='([0]="Marker1" [1]="Tempo:" [2]="120.0" [3]="Silence" [4]="Marker2" [5]="Silence.1" [6]="Marker3" [7]="Silence.2" [8]="Marker4" [9]="Silence.3" [10]="Marker5")'
IFS contains an enumeration of the characters which each can be a field separator. So ", " says "any run of spaces or commas separates my fields".
The simplest workaround I think would be to preprocess the output so you get the breaks where you want them.
IFS='~' MarkerNames=($(exiftool -s3 -TracksMarkersName audioFile.wav | sed 's/, /~/g'))
This of course requires you to find another IFS value which doesn't occur in your data. If Bash 4+ is available, maybe use a newline and readarray.
You could split on commas and remove the leading / trailing spaces afterwards:
IFS=',' read -r -a MarkerNames <<< $(exiftool -s3 -TracksMarkersName audioFile.wav)
shopt -s extglob # Needed for extended glob
MarkerNames=( "${MarkerNames[#]/#*( )}" ) # Remove leading spaces
MarkerNames=( "${MarkerNames[#]/%*( )}" ) # Remove trailing spaces

Bash Convert text string into array with multiple \r\n as field seperator

I have a windows text file in the format:
line\r\n
line\r\n
line\r\n
r\n
line\r\n
line\r\n
line\r\n
r\n
...
I want to put this textfile into an array where the field seperator is \r\n\r\n - I did search for an answer but nothing I found and tried did work . awk for example is too complex for me and FS= did not work as I expected.
Commands to read arrays in bash can (as far as I know) only use single characters as a field separator, not complete strings like \r\n\r\n.
Workaround
First replace the field separator \r\n\r\n with a single char which is not used in the string to be splitted. I found \x1e (the ASCII control character »Record Separator«) to work out quite well.
Then read the array using the new (one character) field separator.
The field separator will always be removed when reading something to an array. But you can append the separator to each field.
Here is a pure bash solution to read the file file into the array array:
IFS=$'\x1e'
filecontent="$(< file)"
array=(${filecontent//$'\r\n\r\n'/$'\x1e'})
array=("${array[#]/%/$'\r\n\r\n'}")
IFS=$'\x1e' sets bash's field separator which is used to split strings into arrays. Depending on your script you may want to restore the old IFS afterwards (default is IFS=$' \t\n').
Results
For file
A B C\r\n
D E F\r\n
\r\n
G H I\r\n
\r\n
the resulting array will have two entries:
${array[0]}
A B C\r\n
D E F\r\n
\r\n
${array[1]}
G H I\r\n
\r\n
Known Problems
IFS at the beginning and end of the string will be trimmed. Repeated IFS will be squeezed. The file \r\n\r\n will result in an array without entries. Empty entries cannot be created.
\r\n\r\n is appended to all entries in all cases. The file A\r\n\r\nB will result an array with the two entries A\r\n\r\n and B\r\n\r\n.
In Linux all lines of files are terminated with \n.
So your problem is not the \r\n , it is just the \r. So just remove it:
$ tr -d '\r' <file >newfile
To verify that \r is removed you can do:
$ head -n2 newfile |od -t x1c
This will get the first two lines of the new file and the od tool will dump / convert those lines in ascii hex codes. In ascii hex \r is \x0d and \n is \x0a.
Once you have removed the \r from your file you can do anything you want.
You can use all linux tools (including awk) straight forward without special settings.
To built an array you can use:
$ while read -r line;do data+=("$line");done <newfile
If you want to skip blank lines , this one is enough:
$ while read -r line;do [[ "$line" == "" ]] && continue;data+=("$line") ;done <file1
You can offcourse combine array creation with removal of the \r on-the-fly, without modifying your existed file like this ( See online testing here. )
while read -r line;do [[ "$line" == "" ]] && continue;data+=("$line") ;done < <(tr -d '\r' <file1)
To see what is inside array "data" just use $ declare -p data
PS: By the way using awk -v RS="\r\n" '{you awk code here}' should be enough even to read the initial file in awk as well. RS = Record (lines) Separator
I made this script in pure bash, even if the answer from socowi is pure bash too:
exec < filern.txt
declare -a array
acc=""
lineno=0
cr=$(echo -en "\r")
while read line; do
line=${line%$cr}
if [ -z "$line" ]; then
let lineno=$lineno+1
array[$lineno]=$acc
acc=""
else
[ ! -z "$acc" ] && acc="$acc--" # you can use any separator here
acc="$acc$line"
fi
done
echo "Read file in array:"
for ((i=1; i<= ${#array[#]}; i++)) do
printf "%3.3d |%s|\n" $i "${array[$i]}"
done
It reads a "real" line of input at a time, and strips the trailing \r.
At this point, a sequence \r\n\r\n turns into an empty line, so that is used to assign the array elements one after the other.
The output from the example file is:
Read file in array:
001 |line--line--line|
002 |line--line--line|
The separator could also be a \r, or whatever. I coudn't find a way to clear the trailing \r with the command line=${line% ?? }, so I used a variable. The same trick can be used to add "strange" separator to the variable ACC. I hope it helps.

Return awk split array to a bash variable

I have a requirement to split a string on a multi-character delimiter and return the values into an array in Bash for further processing
IFS can take a single character delimiter.
a="2;AAAAA;BBBBB;1111_MultiCharDel_2;CCCC;DDDDDD;22222_MultiCharDel_2;EEEE;FFFFFFF;22222"
awk'{split($0,ArrayDeltaMulDep,"_MultiCharDel_")}' <<< $a
The input string can have several substrings separated by the MultiCharDel delimiter.
How can i access this array ArrayDeltaMulDep fur further processing in Bash?
Your example string, a, does not contain newlines. If that is true in general, then:
a="2;AAAAA;BBBBB;1111_MultiCharDel_2;CCCC;DDDDDD;22222"
readarray -t b <<< "${a//MultiCharDel/$'\n'}"
We can verify that this split the string properly using declare -p to show the value of b:
$ declare -p b
declare -a b=([0]="2;AAAAA;BBBBB;1111_" [1]="_2;CCCC;DDDDDD;22222")
How it works:
readarray -t b
This reads lines from stdin and puts then in a bash array b.
<<< "${a//MultiCharDel/$'\n'}"
${a//MultiCharDel/$'\n'} uses pattern substitution to replace MultiCharDel with a newline character. <<< provides the result as stdin to the command readarray.
Hat tip: Chepner
More general solution
A bash string will never contain a null character (hex 00). Using GNU sed:
b=()
while read -d '' -r line
do
b+=("$line")
done < <(sed 's/MultiCharDel/\x00/g; s/$/\x00/' <<<"$a")
This again creates an array with the desired splitting:
$ declare -p b
declare -a b=([0]="2;AAAAA;BBBBB;1111_" [1]="_2;CCCC;DDDDDD;22222")

Bash, split words into letters and save to array

I'm struggling with a project. I am supposed to write a bash script which will work like tr command. At the beginning I would like to save all commands arguments into separated arrays. And in case if an argument is a word I would like to have each char in separated array field,eg.
tr_mine AB DC
I would like to have two arrays: a[0] = A, a[1] = B and b[0]=C b[1]=D.
I found a way, but it's not working:
IFS="" read -r -a array <<< "$a"
No sed, no awk, all bash internals.
Assuming that words are always separated with blanks (space and/or tabs),
also assuming that words are given as arguments, and writing for bash only:
#!/bin/bash
blank=$'[ \t]'
varname='A'
n=1
while IFS='' read -r -d '' -N 1 c ; do
if [[ $c =~ $blank ]]; then n=$((n+1)); continue; fi
eval ${varname}${n}'+=("'"$c"'")'
done <<<"$#"
last=$(eval echo \${#${varname}${n}[#]}) ### Find last character index.
unset "${varname}${n}[$last-1]" ### Remove last (trailing) newline.
for ((j=1;j<=$n;j++)); do
k="A$j[#]"
printf '<%s> ' "${!k}"; echo
done
That will set each array A1, A2, A3, etc. ... to the letters of each word.
The value at the end of the first loop of $n is the count of words processed.
Printing may be a little tricky, that is why the code to access each letter is given above.
Applied to your sample text:
$ script.sh AB DC
<A> <B>
<D> <C>
The script is setting two (array) vars A1 and A2.
And each letter is one array element: A1[0] = A, A1[1] = B and A2[0]=C, A2[1]=D.
You need to set a variable ($k) to the array element to access.
For example, to echo fourth letter (0 based) of second word (1 based) you need to do (that may be changed if needed):
k="A2[3]"; echo "${!k}" ### Indirect addressing.
The script will work as this:
$ script.sh ABCD efghi
<A> <B> <C> <D>
<e> <f> <g> <h> <i>
Caveat: Characters will be split even if quoted. However, quoted arguments is the correct way to use this script to avoid the effect of shell metacharacters ( |,&,;,(,),<,>,space,tab ). Of course, spaces (even if repeated) will split words as defined by the variable $blank:
$ script.sh $'qwer;rttt fgf\ngfg'
<q> <w> <e> <r> <;> <r> <t> <t> <t>
<>
<>
<>
<f> <g> <f> <
> <g> <f> <g>
As the script will accept and correctly process embebed newlines we need to use: unset "${varname}${n}[$last-1]" to remove the last trailing "newline". If that is not desired, quote the line.
Security Note: The eval is not much of a problem here as it is only processing one character at a time. It would be difficult to create an attack based on just one character. Anyway, the usual warning is valid: Always sanitize your input before using this script. Also, most (not quoted) metacharacters of bash will break this script.
$ script.sh qwer(rttt fgfgfg
bash: syntax error near unexpected token `('
I would strongly suggest to do this in another language if possible, it will be a lot easier.
Now, the closest I come up with is:
#!/bin/bash
sentence="AC DC"
words=`echo "$sentence" | tr " " "\n"`
# final array
declare -A result
# word count
wc=0
for i in $words; do
# letter count in the word
lc=0
for l in `echo "$i" | grep -o .`; do
result["w$wc-l$lc"]=$l
lc=$(($lc+1))
done
wc=$(($wc+1))
done
rLen=${#result[#]}
echo "Result Length $rLen"
for i in "${!result[#]}"
do
echo "$i => ${result[$i]}"
done
The above prints:
Result Length 4
w1-l1 => C
w1-l0 => D
w0-l0 => A
w0-l1 => C
Explanation:
Dynamic variables are not supported in bash (ie create variables using variables) so I am using an associative array instead (result)
Arrays in bash are single dimension. To fake a 2D array I use the indexes: w for words and l for letters. This will make further processing a pain...
Associative arrays are not ordered thus results appear in random order when printing
${!result[#]} is used instead of ${result[#]}. The first iterates keys while the second iterates values
I know this is not exactly what you ask for, but I hope it will point you to the right direction
Try this :
sentence="$#"
read -r -a words <<< "$sentence"
for word in ${words[#]}; do
inc=$(( i++ ))
read -r -a l${inc} <<< $(sed 's/./& /g' <<< $word)
done
echo ${words[1]} # print "CD"
echo ${l1[1]} # print "D"
The first read reads all words, the internal one is for letters.
The sed command add a space after each letters to make the string splittable by read -a. You can also use this sed command to remove unwanted characters from words (eg commas) before splitting.
If special characters are allowed in words, you can use a simple grep instead of the sed command (as suggested in http://www.unixcl.com/2009/07/split-string-to-characters-in-bash.html) :
read -r -a l${inc} <<< $(grep -o . <<< $word)
The word array is ${w}.
The letters arrays are named l# where # is an increment added for each word read.

Resources