Bash: set IFS to Space after specific character only? - arrays

I'm using IFS=', ' to split a string of comma-delimited text into an array. The problem is that occasionally one of the comma-delimited items contains a space following a :. The resulting array contains that item as two separate array elements. Is it possible to set IFS to only split ', ' and ignore a comma-delimited item that contains ': ' (or any other character for that matter)?
See the comma-delimited string returned from the first command below, note the second item has the :. See the MarkerNames[1] and MarkerNames[2] to see the unwanted split in the second command below.
$ exiftool -s3 -TracksMarkersName audioFile.wav
Marker1, Tempo: 120.0, Silence, Marker2, Silence.1, Marker3, Silence.2, Marker4, Silence.3, Marker5
$ IFS=', ' read -r -a MarkerNames <<< $(exiftool -s3 -TracksMarkersName audioFile.wav)
$ declare -p MarkerNames
declare -a MarkerNames='([0]="Marker1" [1]="Tempo:" [2]="120.0" [3]="Silence" [4]="Marker2" [5]="Silence.1" [6]="Marker3" [7]="Silence.2" [8]="Marker4" [9]="Silence.3" [10]="Marker5")'

IFS contains an enumeration of the characters which each can be a field separator. So ", " says "any run of spaces or commas separates my fields".
The simplest workaround I think would be to preprocess the output so you get the breaks where you want them.
IFS='~' MarkerNames=($(exiftool -s3 -TracksMarkersName audioFile.wav | sed 's/, /~/g'))
This of course requires you to find another IFS value which doesn't occur in your data. If Bash 4+ is available, maybe use a newline and readarray.

You could split on commas and remove the leading / trailing spaces afterwards:
IFS=',' read -r -a MarkerNames <<< $(exiftool -s3 -TracksMarkersName audioFile.wav)
shopt -s extglob # Needed for extended glob
MarkerNames=( "${MarkerNames[#]/#*( )}" ) # Remove leading spaces
MarkerNames=( "${MarkerNames[#]/%*( )}" ) # Remove trailing spaces

Related

How to Convert String in a Text File into Array in Bash

I have text file containing quotes inside single quotes. They are "not" one liners. eg there may be two quotes in same line, but all quotes are inside single quotes like
'hello world' 'this is the second quotes' 'and this is the third quoted text'
how can I create an array and making each quoted text an element of the array. I've tried using
declare -a arr=($(cat file.txt))
but that makes it space separated. assigns each word an element in the array
If you have bash v4.4 or later, you can use xargs to parse the quoted strings and turn them into null-delimited strings, then readarray to turn that into a bash array:
readarray -t -d '' arr < <(xargs printf '%s\0' <file.txt)
If you have an older version of bash, you'll have to create the array element-by-element:
arr=( )
while IFS= read -r -d '' quote; do
arr+=( "$quote" )
done < <(xargs printf '%s\0' <file.txt)
Note that xargs quote syntax is a little different from everything else (of course). It allows both single- and double-quoted strings, but doesn't allow escaped quotes within those strings. And it probably varies a bit between versions of xargs.

awk through text file with different delimiter count into array

I have a text file that has 8000 lines, here is an example
00122;IL;Chicago;Router;;1496009459
00133;IL;Chicago;Router;0;6.651;1496009460
00166;IL;Chicago;Router;0;5.798;1496009460
00177;IL;Chicago;Router;0;5.365;1496009460
00188;IL;Chicago;Router;0;22.347;1496009460
As you can see the file has different count of delimiter, I need to insert all columns separated by ';' to an array no matter when the the delimiter occurs
So the first line would have 6 fields and the second line would have 7.
When I tried do it through the below command
Number=( $(awk '{print $1}' $FileName.txt) ) with different array name and field for each columns, I am getting strange behavior which not all fields are printed for some lines when I echo them all in one line
Performance is very important (need to do it in a matter of seconds )and I found using awk is the fastest approach so far, unless someone has better approach.
An ideas why this is happening ?
To dump the entire text file into an array, I would use the following. In this example, we use the two arrays ${finalarray[]} and ${subarray[]} (though the latter is unset at the end) and the variable $line. We assume the file name is file.txt.
#!/bin/bash
finalarray=()
while read line; do #For each cycle, the variable $line is the next line of the file
if [[ -z $line ]]; then continue; done #If the line is empty, skip this cycle
IFS=";" read -r -a subarray <<< "$line" #Split $line into ${subarray[]} using : as delim
finalarray+=( "$subarray[#]}" ) #Add every element from ${subarray[]} to ${finalarray[]}
unset subarray #clears the array
done <file.txt
If your empty lines are, in fact, populated by spaces or other whitespace characters, the empty line catch won't work. Instead, you could use something like the following to skip any lines not containing semicolons.
if [[ $(echo "$line" | grep -c ";") -eq 0 ]]; then continue; fi
On the other hand, this would skip all lines without a semicolon, even if you intended some of those lines to be a single array entry.

Bash Convert text string into array with multiple \r\n as field seperator

I have a windows text file in the format:
line\r\n
line\r\n
line\r\n
r\n
line\r\n
line\r\n
line\r\n
r\n
...
I want to put this textfile into an array where the field seperator is \r\n\r\n - I did search for an answer but nothing I found and tried did work . awk for example is too complex for me and FS= did not work as I expected.
Commands to read arrays in bash can (as far as I know) only use single characters as a field separator, not complete strings like \r\n\r\n.
Workaround
First replace the field separator \r\n\r\n with a single char which is not used in the string to be splitted. I found \x1e (the ASCII control character »Record Separator«) to work out quite well.
Then read the array using the new (one character) field separator.
The field separator will always be removed when reading something to an array. But you can append the separator to each field.
Here is a pure bash solution to read the file file into the array array:
IFS=$'\x1e'
filecontent="$(< file)"
array=(${filecontent//$'\r\n\r\n'/$'\x1e'})
array=("${array[#]/%/$'\r\n\r\n'}")
IFS=$'\x1e' sets bash's field separator which is used to split strings into arrays. Depending on your script you may want to restore the old IFS afterwards (default is IFS=$' \t\n').
Results
For file
A B C\r\n
D E F\r\n
\r\n
G H I\r\n
\r\n
the resulting array will have two entries:
${array[0]}
A B C\r\n
D E F\r\n
\r\n
${array[1]}
G H I\r\n
\r\n
Known Problems
IFS at the beginning and end of the string will be trimmed. Repeated IFS will be squeezed. The file \r\n\r\n will result in an array without entries. Empty entries cannot be created.
\r\n\r\n is appended to all entries in all cases. The file A\r\n\r\nB will result an array with the two entries A\r\n\r\n and B\r\n\r\n.
In Linux all lines of files are terminated with \n.
So your problem is not the \r\n , it is just the \r. So just remove it:
$ tr -d '\r' <file >newfile
To verify that \r is removed you can do:
$ head -n2 newfile |od -t x1c
This will get the first two lines of the new file and the od tool will dump / convert those lines in ascii hex codes. In ascii hex \r is \x0d and \n is \x0a.
Once you have removed the \r from your file you can do anything you want.
You can use all linux tools (including awk) straight forward without special settings.
To built an array you can use:
$ while read -r line;do data+=("$line");done <newfile
If you want to skip blank lines , this one is enough:
$ while read -r line;do [[ "$line" == "" ]] && continue;data+=("$line") ;done <file1
You can offcourse combine array creation with removal of the \r on-the-fly, without modifying your existed file like this ( See online testing here. )
while read -r line;do [[ "$line" == "" ]] && continue;data+=("$line") ;done < <(tr -d '\r' <file1)
To see what is inside array "data" just use $ declare -p data
PS: By the way using awk -v RS="\r\n" '{you awk code here}' should be enough even to read the initial file in awk as well. RS = Record (lines) Separator
I made this script in pure bash, even if the answer from socowi is pure bash too:
exec < filern.txt
declare -a array
acc=""
lineno=0
cr=$(echo -en "\r")
while read line; do
line=${line%$cr}
if [ -z "$line" ]; then
let lineno=$lineno+1
array[$lineno]=$acc
acc=""
else
[ ! -z "$acc" ] && acc="$acc--" # you can use any separator here
acc="$acc$line"
fi
done
echo "Read file in array:"
for ((i=1; i<= ${#array[#]}; i++)) do
printf "%3.3d |%s|\n" $i "${array[$i]}"
done
It reads a "real" line of input at a time, and strips the trailing \r.
At this point, a sequence \r\n\r\n turns into an empty line, so that is used to assign the array elements one after the other.
The output from the example file is:
Read file in array:
001 |line--line--line|
002 |line--line--line|
The separator could also be a \r, or whatever. I coudn't find a way to clear the trailing \r with the command line=${line% ?? }, so I used a variable. The same trick can be used to add "strange" separator to the variable ACC. I hope it helps.

Remove spaces from all elements of an array

I've sample file named 'test.in' like below, with leading and trailing spaces:
Hi | hello how | are you?
I need | to remove | leading & trailing
spaces | where ever | it's located
I need to put the elements separated by "|" in array with "\n" as the main delimiter.
I want that each element in array shouldn't have leading or trailing spaces, only spaces in between characters are allowed. I'm using a sample code to test the results before I put the code is my primary deployment.
Portion of sample script:
OIFS="$IFS"
IFS=$'\n'
while read LINE
do
IFS='|'
my_tmpary=($LINE)
echo ${my_tmpary[#]}
my_ary=`echo ${my_tmpary[#]} | awk '$1=$1'`
echo ${my_ary[#]}
done < test.in
I would like NOT to use a loop to clean up the extra spaces.
I did trial & error methods using sed, awk, tr, but it's not a success for me yet.
my_ary[#] should be like this
Hi hello how are you?
I need to remove leading & trailing
spaces where ever it's located
You can (ab)use the fact that read will remove leading and trailing spaces when IFS is default:
while read -r line; do
printf "%s\n" "$line" # Leading and trailing spaces are removed.
done < test.in
Alternative you can sed for such task:
sed 's/^[[:space:]]*\|[[:space:]]*$//g' test.in
sed -E 's/[\t]{1,}/ /g;s/^ *| *$//g;s/[ ]{2,}/ /g;s/ *\| */ /g;' test.in
should do it. So :
$ my_ary=$(sed -E 's/[\t]{1,}/ /g;s/^ *| *$//g;s/[ ]{2,}/ /g;s/ *\| */ /g;' test.in)
$ echo "$my_ary"
Hi hello how are you?
I need to remove leading & trailing
spaces where ever it's located
This lines has many tabs
solves your problem.
Notes
1. s/[\t]{1,}/ /g converts the tabs ie \t to whitespaces.
2. s/^ *| *$//g removes the leading & trailing whitespaces.
3. s/[ ]{2,}/ /g squeezes multiple whitespaces to one.
4. s/ *\| */|/g removes the spaces around |
5. The -E enables the use of extended regex with sed.
In AWK:
awk '{gsub(/^ *| *$|/,"",$0); gsub(/ *\| *| +/," ",$0); print $0}' test.in
gsub(/^ *| *$|/,"",$0) remove leading and trailing space,
gsub(/ *\| *| +/," ",$0) replace space-pipe-space combos and multiple spaces with a single space,
print $0 print the whole record. $0 could be omitted like #mona_sax commented but for the sake of clarity I left it in code.
Surely it could be looped and each pipe delimited field trimmed separately:
awk -F\| -v OFS=" " ' # set input field separator to "|" and output separator to " "
function trim(str) {
gsub(/^ *| *$/,"",str); # remove leading and trailing space from the field
gsub(/ +/," ",str); # mind those multiple spaces as well
return str # return cleaned field
}
{
for(i=1;i<=NF;i++) # loop thru all pipe separated fields
printf "%s%s", trim($i),(i<NF?OFS:ORS) # print field and OFS or
} # output record separator in the end of record
' test.in
.

Bash, split words into letters and save to array

I'm struggling with a project. I am supposed to write a bash script which will work like tr command. At the beginning I would like to save all commands arguments into separated arrays. And in case if an argument is a word I would like to have each char in separated array field,eg.
tr_mine AB DC
I would like to have two arrays: a[0] = A, a[1] = B and b[0]=C b[1]=D.
I found a way, but it's not working:
IFS="" read -r -a array <<< "$a"
No sed, no awk, all bash internals.
Assuming that words are always separated with blanks (space and/or tabs),
also assuming that words are given as arguments, and writing for bash only:
#!/bin/bash
blank=$'[ \t]'
varname='A'
n=1
while IFS='' read -r -d '' -N 1 c ; do
if [[ $c =~ $blank ]]; then n=$((n+1)); continue; fi
eval ${varname}${n}'+=("'"$c"'")'
done <<<"$#"
last=$(eval echo \${#${varname}${n}[#]}) ### Find last character index.
unset "${varname}${n}[$last-1]" ### Remove last (trailing) newline.
for ((j=1;j<=$n;j++)); do
k="A$j[#]"
printf '<%s> ' "${!k}"; echo
done
That will set each array A1, A2, A3, etc. ... to the letters of each word.
The value at the end of the first loop of $n is the count of words processed.
Printing may be a little tricky, that is why the code to access each letter is given above.
Applied to your sample text:
$ script.sh AB DC
<A> <B>
<D> <C>
The script is setting two (array) vars A1 and A2.
And each letter is one array element: A1[0] = A, A1[1] = B and A2[0]=C, A2[1]=D.
You need to set a variable ($k) to the array element to access.
For example, to echo fourth letter (0 based) of second word (1 based) you need to do (that may be changed if needed):
k="A2[3]"; echo "${!k}" ### Indirect addressing.
The script will work as this:
$ script.sh ABCD efghi
<A> <B> <C> <D>
<e> <f> <g> <h> <i>
Caveat: Characters will be split even if quoted. However, quoted arguments is the correct way to use this script to avoid the effect of shell metacharacters ( |,&,;,(,),<,>,space,tab ). Of course, spaces (even if repeated) will split words as defined by the variable $blank:
$ script.sh $'qwer;rttt fgf\ngfg'
<q> <w> <e> <r> <;> <r> <t> <t> <t>
<>
<>
<>
<f> <g> <f> <
> <g> <f> <g>
As the script will accept and correctly process embebed newlines we need to use: unset "${varname}${n}[$last-1]" to remove the last trailing "newline". If that is not desired, quote the line.
Security Note: The eval is not much of a problem here as it is only processing one character at a time. It would be difficult to create an attack based on just one character. Anyway, the usual warning is valid: Always sanitize your input before using this script. Also, most (not quoted) metacharacters of bash will break this script.
$ script.sh qwer(rttt fgfgfg
bash: syntax error near unexpected token `('
I would strongly suggest to do this in another language if possible, it will be a lot easier.
Now, the closest I come up with is:
#!/bin/bash
sentence="AC DC"
words=`echo "$sentence" | tr " " "\n"`
# final array
declare -A result
# word count
wc=0
for i in $words; do
# letter count in the word
lc=0
for l in `echo "$i" | grep -o .`; do
result["w$wc-l$lc"]=$l
lc=$(($lc+1))
done
wc=$(($wc+1))
done
rLen=${#result[#]}
echo "Result Length $rLen"
for i in "${!result[#]}"
do
echo "$i => ${result[$i]}"
done
The above prints:
Result Length 4
w1-l1 => C
w1-l0 => D
w0-l0 => A
w0-l1 => C
Explanation:
Dynamic variables are not supported in bash (ie create variables using variables) so I am using an associative array instead (result)
Arrays in bash are single dimension. To fake a 2D array I use the indexes: w for words and l for letters. This will make further processing a pain...
Associative arrays are not ordered thus results appear in random order when printing
${!result[#]} is used instead of ${result[#]}. The first iterates keys while the second iterates values
I know this is not exactly what you ask for, but I hope it will point you to the right direction
Try this :
sentence="$#"
read -r -a words <<< "$sentence"
for word in ${words[#]}; do
inc=$(( i++ ))
read -r -a l${inc} <<< $(sed 's/./& /g' <<< $word)
done
echo ${words[1]} # print "CD"
echo ${l1[1]} # print "D"
The first read reads all words, the internal one is for letters.
The sed command add a space after each letters to make the string splittable by read -a. You can also use this sed command to remove unwanted characters from words (eg commas) before splitting.
If special characters are allowed in words, you can use a simple grep instead of the sed command (as suggested in http://www.unixcl.com/2009/07/split-string-to-characters-in-bash.html) :
read -r -a l${inc} <<< $(grep -o . <<< $word)
The word array is ${w}.
The letters arrays are named l# where # is an increment added for each word read.

Resources