Read delimited multiline string file into multiple arrays in Bash

Read delimited multiline string file into multiple arrays in Bash - arrays

I began with a file like so:
Table_name1 - Table_desc1
Table_name2 - Table_desc2
...
...
I have a script that parses this file and splits them into two arrays:
declare -a TABLE_IDS=()
declare -a TABLE_DESCS=()
while IFS= read -r line || [[ -n "${line}" ]]; do
TABLE_IDS[i]=${line%' '-' '*}
TABLE_DESCS[i++]=${line#*' '-' '}
done < "${TABLE_LIST}"
for i in "${!TABLE_IDS[#]}"; do
echo "Creating Table ID: "${TABLE_IDS[i]}", with Table Description: "${TABLE_DESCS[i]}""
done
This works really well, with no problems whatsoever.
I wanted to extend this and make the file:
Table_name1 - Table_desc1 - Table_schema1
Table_name2 - Table_desc2 - Table_schema2
...
...
For this, I tried:
declare -a TABLE_IDS=()
declare -a TABLE_DESCS=()
while IFS= read -r line || [[ -n "${line}" ]]; do
TABLE_IDS[i]="$(echo $line | cut -f1 -d - | tr -d ' ')"
TABLE_DESCS[i++]="$(echo $line | cut -f2 -d - | tr -d ' ')"
TABLE_SCHEMAS[i++]="$(echo $line | cut -f3 -d - | tr -d ' ')"
done < "${TABLE_LIST}"
for i in "${!TABLE_IDS[#]}"; do
echo "Creating Table ID: "${TABLE_IDS[i]}", with Table Description: "${TABLE_DESCS[i]}" and schema: "${TABLE_SCHEMAS[i]}""
done
And while this will faithfully list all the Table IDs and the Table descriptions, the schemas are omitted. I tried:
while IFS= read -r line || [[ -n "${line}" ]]; do
TABLE_IDS[i]="$(echo $line | cut -f1 -d - | tr -d ' ')"
TABLE_DESCS[i]="$(echo $line | cut -f2 -d - | tr -d ' ')"
TABLE_SCHEMAS[i]="$(echo $line | cut -f3 -d - | tr -d ' ')"
done < "${TABLE_LIST}"
And it returns just the last line's Table name, description AND schema. I suspect this is an indexing/looping problem, but am unable to figure out what exactly is going wrong. Please help! Thanks!

perhaps set the delimiter to the actual delimiter - and do the processing in the read loop instead of deferring and using arrays.
$ while IFS=- read -r t d s;
do
echo "Creating Table ID: ${t// }, with Table Description: ${d// } and schema: ${s// }";
done < file

Related

diff two arrays each containing files paths into a third array (for removal)

In the function below you will see notes on several attempts to solve this problem; each attempt has a note indicating what went wrong. Between my attempts there is a line from another question here which purports to solve some element of the matter. Again, I've added a note indicating what that is supposed to solve. My brain is mush at this point. What is the stupid simple thing I've overlooking?
function func_removeDestinationOrphans() {
readarray -d '' A_Destination_orphans < <( find "${directory_PMPRoot_destination}" -type f -print0 )
for (( i = 0 ; i < ${#A_Destination_orphans[#]} ; i++ )) ; do
printf '%s\n' "→ ${A_Destination_orphans[${i}]}" # path to each track
done
printf '%b\n' ""
# https://stackoverflow.com/questions/2312762/compare-difference-of-two-arrays-in-bash
# echo ${Array1[#]} ${Array2[#]} | tr ' ' '\n' | sort | uniq -u ## original
# Array3=(`echo ${Array1[#]} ${Array2[#]} | tr ' ' '\n' | sort | uniq -u `) ## store in array
# A_Destination_orphans_diff=(`echo "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | tr ' ' '\n' | sort | uniq -u `) # drops file path after space
# printf "%s\0" "${Array1[#]}" "${Array2[#]}" | sort -z | uniq -zu ## newlines and white spaces
# A_Destination_orphans_diff=($( printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort -z | uniq -zu )) # throws warning and breaks at space but not newline
# printf '%s\n' "${Array1[#]}" "${Array2[#]}" | sort | uniq -u ## manage spaces
# A_Destination_orphans_diff=($( printf '%s\n' "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort | uniq -u )) # breaks at space and newline
# A_Destination_orphans_diff="($( printf '%s\n' "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort | uniq -u ))" # creates string surrounded by ()
# A_Destination_orphans_diff=("$( printf '%s\n' "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort | uniq -u )") # creates string
# A_Destination_orphans_diff=($( printf '%s\n' ${A_Destination_dubUnders[#]} ${A_Destination_orphans[#]} | sort | uniq -u )) # drops file path after space
for (( i = 0 ; i < ${#A_Destination_orphans_diff[#]} ; i++ )) ; do
printf '%s\n' "→ ${A_Destination_orphans_diff[${i}]}" # path to each track
done
printf '%b\n' ""
for (( i = 0 ; i < ${#A_Destination_orphans_diff[#]} ; i++ )) ; do
echo # rm "${A_Destination_orphans_diff[i]}"
done
func_EnterToContinue
}

This throws warning and breaks at space but not newline because you build the array with direct assignment of syntax construct. When an entry contains spaces, it also splits break to a new entry.
A_Destination_orphans_diff=($( printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort -z | uniq -zu ))
To avoid the issue of the method above, you can mapfile/readarray a null delimited entries stream.
mapfile -t -d '' A_Destination_orphans_diff < <(
printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" |
sort -z |
uniq -zu
)
In case your shell version is too old to support mapfile you can perform the same task with IFS=$'\37' read -r -d '' -a array.
$'\37' is shell's C-Style string syntax with octal code 37, which is ASCII 31 US for Unit Separator:
IFS=$'\37' read -r -d '' -a A_Destination_orphans_diff < <(
printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" |
sort -z |
uniq -zu |
xargs -0 printf '%s\37'
)

To remove all files not present in A_Destination_dubUnders array you could:
func_removeDestinationOrphans() {
find "${directory_PMPRoot_destination}" -type f -print0 |
sort -z |
join -z -v1 -t '' - <(printf "%s\0" "${A_Destination_dubUnders[#]}" | sort -z) |
xargs -0 echo rm
}
Use join or comm to find elements not present in one list and present in another list. I am usually wrong about -v1, so try with -v2 if it echoes the elements from wrong list (I do not understand if you want to remove files present in A_Destination_dubUnders list or not present, you did not specify that).
Note that function name() is a mix of ksh and posix function definition. Just name() {. See bash hackers wiki obsolete

Here is the working version with modifications thanks to suggested input from the first two respondents (thanks!).
function func_removeDestinationOrphans() {
printf '%s\n' " → Purge playlist orphans: " ""
printf '%b\n' "First we will remove any files not present in your proposed playlist. "
func_EnterToContinue
bash_version="$( bash --version | head -n1 | cut -d " " -f4 | cut -d "(" -f1 )"
if printf '%s\n' "4.4.0" "${bash_version}" | sort -V -C ; then
readarray -d '' A_Destination_orphans < <( find "${directory_PMPRoot_destination}" -type f -print0 ) # readarray or mapfile -d fails before bash 4.4.0
readarray -t -d '' A_Destination_orphans_diff < <(
printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" |
sort -z |
uniq -zu
)
else
while IFS= read -r -d $'\0'; do
A_Destination_orphans+=( "$REPLY" )
done < <( find "${directory_PMPRoot_destination}" -type f -print0 )
IFS=$'\37' read -r -d '' -a A_Destination_orphans_diff < <(
printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" |
sort -z |
uniq -zu |
xargs -0 printf '%s\37'
)
fi
if [[ ! "${A_Destination_orphans_diff[*]}" = '' ]] ; then
for (( i = 0 ; i < ${#A_Destination_orphans_diff[#]} ; i++ )) ; do
rm "${A_Destination_orphans_diff[i]}"
done
fi
}
If you would like to see the entire Personal Music Player sync script, you can find that via my GitHub.

How to speed up bash random name generation?

i have problem with my code performance. It is running very slow. I need to generate million+ random persons for my postgres db and insert them into db. Person has parameters name,birthdate,gender,age. I created lists for first names and last names from which i am randomly selecting name. Can someone help me?
Here is my code:
#docker params
name="`docker ps | rev | cut -d " " -f1 | rev | grep -v NAMES`"
dbs_name="DBS_projekt"
#load names from files
firstName=(`cat generatorSource/firstNames.txt`)
firstNameCount="`wc -l generatorSource/firstNames.txt | tr -s ' ' | cut -d ' ' -f2`"
secondName=(`cat generatorSource/lastNames.txt`)
secondNameCount="`wc -l generatorSource/lastNames.txt| tr -s ' ' | cut -d ' ' -f2`"
#gender array
gender=("Male" "Female" "Other")
#actual date
now=$(date | rev | cut -d " " -f1 | rev)
array=()
for ((x = 1; x <= 1000;x++))
do
array+="INSERT INTO persons(name,birthdate,gender,age) VALUES"
for (( n=1; n<=1000; n++ ))
do
secondrand=$(( ( RANDOM % $secondNameCount ) ))
firstrand=$(( ( RANDOM % $firstNameCount ) ))
genderand=$(( ( RANDOM % 3 ) ))
year=$(( ( RANDOM % 118 ) + 1900 ))
month=$(((RANDOM % 12) + 1))
day=$(((RANDOM % 28) + 1))
age=$(expr $now - $year)
if [ $n -eq 1000 ]; then
array+="('${firstName[$firstrand]}
${secondName[$secondrand]}','$year-$month-$day',
'${gender[$genderand]}','$age');"
else
array+="('${firstName[$firstrand]}
${secondName[$secondrand]}','$year-$month-$day',
'${gender[$genderand]}','$age'),"
fi
done
done
#run psql in docker and run insert commands
docker exec -i $name psql -U postgres << EOF
\c $dbs_name
$array
EOF

Note that you declare "array" as an array, but you use it as a string.
array=()
...
array+="INSERT INTO persons(name,birthdate,gender,age) VALUES"
This is what's happening:
$ array=()
$ declare -p array
declare -a array='()'
$ array+="first"
$ array+="second"
$ declare -p array
declare -a array='([0]="firstsecond")'
To insert an element into an array, you must use parentheses:
$ array=()
$ array+=("first")
$ array+=("second")
$ declare -p array
declare -a array='([0]="first" [1]="second")'
I suspect this may be one source of slowness: you're constructing one gigantic string. Add the parentheses as shown, and then change the docker call to
IFS=$'\n'
docker exec -i $name psql -U postgres << EOF
\c $dbs_name
${array[*]}
EOF

Splitting files into multiple files based on some pattern and take some information

I'm working with a lot of files with this structure:
BEGIN
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=1393
PEPMASS=946.3980102539062
CHARGE=3.0+
USER03=
SEQ=DDDIAAL
TAXONOMY=9606
272.228 126847.000
273.252 33795.000
END
BEGIN IONS
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=1383
PEPMASS=911.3920288085938
CHARGE=2.0+
USER03=
SEQ=QGKFEAAETLEEAAMR
TAXONOMY=9606
1394.637 71404.000
1411.668 122728.000
END
BEGIN IONS
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=2965
PEPMASS=946.3900146484375
CHARGE=3.0+
TAXONOMY=9606
1564.717 92354.000
1677.738 33865.000
END
This structure is repeated thousands of times but with different data inside. As you can see, between some begin-end, sometimes SEQ and USER03 are not there. This is because the protein is not identified ... And here comes my problem.
I would like to know how many proteins are identified and how many are unidentified. To do this I was trying this:
for i in $(ls *.txt ); do
echo $i
awk '/^BEGIN/{n++;w=1} n&&w{print > "./cache/out" n ".txt"} /^END/{w=0}' $i
done
I found this here (Split a file into multiple files based on a pattern and name the new files by the search pattern in Unix?)
And then use the outputs and classify them:
for i in $(ls cache/*.txt ); do
echo $i
if grep -q 'SEQ' $i; then
mv $i ./archive_identified
else
mv $i ./archive_unidentified
fi
done
After this, I'd like to take some data (Example: spectrum, USER03, SEQ, TAXONOMY) from classified files.
for I in $( ls archive_identified/*.txt ); do
echo $i
grep 'SEQ' $i | cut -d "=" -f2- | tr ',' '\n' >> ./sequences_ide.txt
grep 'TAXONOMY' $i | cut -d "=" -f2- | tr ',' '\n' >> ./taxonomy_ide.txt
grep 'USER' $i | cut -d "=" -f2- >> ./modifications_ide.txt
grep 'TITLE' $i | sed 's/^.*\(spectrum.*\)/\1/g' | cut -d "=" -f2- >> ./spectrum.txt
done
for i in $( ls archive_unidentified/*.txt ); do
echo $i
grep 'SEQ' $i | cut -d "=" -f2- | tr ',' '\n' >> ./sequences_unide.txt
grep 'TAXONOMY' $i | cut -d "=" -f2- | tr ',' '\n' >> ./taxonomy_unide.txt
grep 'USER' $i | cut -d "=" -f2- >> ./modifications_unide.txt
grep 'TITLE' $i | sed 's/^.*\(spectrum.*\)/\1/g' | cut -d "=" -f2- >> ./spectrum_unide.txt
done
The problem is that the first part of the script takes too much time due to the large size of the data (12-15gb.). Is there any way to do this easier?
Thank you in advance.

You can do all in one awk script. awk can iterate through all rows (records) so you don't need an external loop. For example, for the data file you provided
$ awk -v RS= '/\nSEQ/ {seq++; print > "file_path_with_seq" NR ".txt"; next}
{noseq++; print > "file_path_without_seq" NR ".txt"}
END { print "with seq:", seq;
print "without seq:", noseq}' file
will print
with seq: 2
without seq: 1
and produces the files
$ head file_path_with*
==> file_path_with_seq1.txt <==
BEGIN
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=1393
PEPMASS=946.3980102539062
CHARGE=3.0+
USER03=
SEQ=DDDIAAL
TAXONOMY=9606
272.228 126847.000
273.252 33795.000
END
==> file_path_with_seq2.txt <==
BEGIN IONS
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=1383
PEPMASS=911.3920288085938
CHARGE=2.0+
USER03=
SEQ=QGKFEAAETLEEAAMR
TAXONOMY=9606
1394.637 71404.000
1411.668 122728.000
END
==> file_path_without_seq3.txt <==
BEGIN IONS
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=2965
PEPMASS=946.3900146484375
CHARGE=3.0+
TAXONOMY=9606
1564.717 92354.000
1677.738 33865.000
END

Shell script: Sed substitution throwing unknown command: ` '

I'm having some trouble getting around an issue related to performing a sed substitution with array's being passed into it as variables. I'm nearly certain it has something to do with the way I'm passing the variables, but I've been scouring for hours for a solution to no avail.
The arrays are initialized in the first six lines of code before attempting a sed substitution. The full script is below:
#!/bin/bash
scalarLineNums=( $(grep -nr 'MCSTEP\|NCSTEP\|ISAVE\|DCGRAX\|DCGRAY\|DCGRAZ\|DCSTEC\|DCTIME\|ICOUTF\|ICOUTI\|D1PEKS\|D1PEFR\|D1PEPF\|MBCON\|NBCON\|D1BNVX\|D1BNVY\|D1BNVZ\|I1BNVX\|I1BNVY\|I1BNVZ\|D1BNFX\|D1BNFY\|D1BNFZ\|D1BNAX\|D1BNAY\|D1BNAZ' Layer_Rough.Y3D | cut -d ":" -f 1) )
vectorLineNums=( $(sed -n -e '/D1BNVX\|D1BNVY\|D1BNVZ\|I1BNVX\|I1BNVY\|I1BNVZ\|D1BNFX\|D1BNFY\|D1BNFZ\|D1BNAX\|D1BNAY\|D1BNAZ/{n;=;p;}' Layer_Rough.Y3D | sed -n 1~2p) )
scalarValuesOriginal=( $(grep -nr 'MCSTEP\|NCSTEP\|ISAVE\|DCGRAX\|DCGRAY\|DCGRAZ\|DCSTEC\|DCTIME\|ICOUTF\|ICOUTI\|D1PEKS\|D1PEFR\|D1PEPF\|MBCON\|NBCON\|D1BNVX\|D1BNVY\|D1BNVZ\|I1BNVX\|I1BNVY\|I1BNVZ\|D1BNFX\|D1BNFY\|D1BNFZ\|D1BNAX\|D1BNAY\|D1BNAZ' Layer_Rough.Y3D | awk -F ' +' '{print $2}') )
vectorValuesOriginal=( $(sed -n -e '/D1BNVX\|D1BNVY\|D1BNVZ\|I1BNVX\|I1BNVY\|I1BNVZ\|D1BNFX\|D1BNFY\|D1BNFZ\|D1BNAX\|D1BNAY\|D1BNAZ/{n;=;p;}' Layer_Rough.Y3D | sed -n 2~2p) )
scalarValuesNew=( $(grep -nr 'MCSTEP\|NCSTEP\|ISAVE\|DCGRAX\|DCGRAY\|DCGRAZ\|DCSTEC\|DCTIME\|ICOUTF\|ICOUTI\|D1PEKS\|D1PEFR\|D1PEPF\|MBCON\|NBCON\|D1BNVX\|D1BNVY\|D1BNVZ\|I1BNVX\|I1BNVY\|I1BNVZ\|D1BNFX\|D1BNFY\|D1BNFZ\|D1BNAX\|D1BNAY\|D1BNAZ' new-variable-list.txt | awk -F ' +' '{print $2}') )
vectorValuesNew=( $(sed -n -e '/D1BNVX\|D1BNVY\|D1BNVZ\|I1BNVX\|I1BNVY\|I1BNVZ\|D1BNFX\|D1BNFY\|D1BNFZ\|D1BNAX\|D1BNAY\|D1BNAZ/{n;=;p;}' new-variable-list.txt | sed -n 2~2p) )
i=0
for linenumber in "${scalarLineNums[#]}"
do
sed -i "${linenumber}s/${scalarValuesOriginal[$i]}/${scalarValuesNew[$i]}/" Layer_Rough.Y3D
i=$((i+1))
done
The error I receive when trying to run the script is
sed: -e expression #1, char 2: unknown command: `
'
The for loop is an attempt to perform a substitution on a per line basis, i.e., sed 'line#s/oldvalue/newvalue/'. A few of the elements contain '+' and '-' characters as some of the values are stored in scientific notation, but do not contain any slashes or whitespace.

shell script array won't populate from for loop

Can anyone tell me why this array creation: cccr[$string_1]=$string_2 #doesn't work?
#!/bin/bash
firstline='[Event "Marchand Open"][Site "Rochester NY"][Date "2005.03.19"][Round "1"][White "Smith, Igor"][Black "Jones, Matt"][Result "1-0"][ECO "C01"][WhiteElo "2409"][BlackElo "1911"]'
unset cccr
declare -A cccr
(IFS='['; for word in $firstline; do
string_1=$(echo $word | cut -f1 -d'"' | tr -d ' ')
string_2=$( echo $word | cut -f2 -d'"' )
if [ ! -z $string_1 ]; then # If $string_1 is not empty
cccr[$string_1]=$string_2 # why doesn't this line work?
fi
done)
echo ${cccr[Event]} # echos null string

It happens because the value of string_1 is empty at the first iteration.
Example :
#!/bin/bash
firstline='[Event "Marchand Open"][Site "Rochester NY"][Date "2005.03.19"][Round "1"][White "Smith, Igor"][Black "Jones, Matt"][Result "1-0"][ECO "C01"][WhiteElo "2409"][BlackElo "1911"]'
unset cccr
declare -A cccr
(IFS='['; for word in $firstline; do
string_1=$( echo $word | cut -f1 -d'"' )
string_2=$( echo $word | cut -f2 -d'"' )
echo "$string_1 - $string_2"
#cccr[$string_1]=$string_2
done)
Output :
- # Problem !
Event - Marchand Open
Site - Rochester NY
...
You have to modify your script to prevent the value of being empty.
A very simple workaround is to check the value of string_1 before using it.
Example :
# ...
string_1=$( echo $word | cut -f1 -d'"' )
string_2=$( echo $word | cut -f2 -d'"' )
if [ ! -z $string_1 ]; then # If $string_1 is not empty
echo "$string_1 - $string_2"
cccr[$string_1]=$string_2
fi
# ...
From the man page of [
-z STRING
the length of STRING is zero
Output :
Event - Marchand Open
Site - Rochester NY
# ... No problem
EDIT
BTW, if look at the value of string_1, you will see that the value is Event' ' and not Event (there's a whitespace at the end of Event)
So cccr[Event] does not exist, but cccr[Event ] exists.
To fix that, you can delete the whitespaces in string_1 :
string_1=$(echo $word | cut -f1 -d'"' | tr -d ' ') # tr -d ' ' deletes all the whitespaces
EDIT 2
I forgot to tell you that it's normal if it does not work. Indeed, the loop is executed in a subshell environment. So the array is filled in the subshell, but not in the current shell.
From the man page of bash :
(list) list is executed in a subshell environment (see COMMAND EXECUTION ENVIRONMENT below). Variable
assignments and builtin commands that affect the shell's environment do not remain in effect
after the command completes. The return status is the exit status of list.
So there are 2 solutions :
1. Don't run the loop in a subshell (remove the parentheses).
# ...
OLDIFS=$IFS
IFS='['
for word in $firstline; do
string_1=$(echo $word | cut -f1 -d'"' | tr -d ' ')
string_2=$(echo $word | cut -f2 -d'"')
if [ ! -z $string_1 ]; then
cccr[$string_1]=$string_2
fi
done
IFS=$OLDIFS
echo "Event = ${cccr[Event]}"
echo "Site = ${cccr[Site]}"
Output :
Event = Marchand Open
Site = Rochester NY
2. Use your array in the subshell.
# ...
(IFS='['
for word in $firstline; do
string_1=$(echo $word | cut -f1 -d'"' | tr -d ' ')
string_2=$(echo $word | cut -f2 -d'"')
if [ ! -z $string_1 ]; then # If $string_1 is not empty
cccr[$string_1]=$string_2
fi
done
echo "Event = ${cccr[Event]}"
echo "Site = ${cccr[Site]}"
)
Output :
Event = Marchand Open
Site = Rochester NY

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Read delimited multiline string file into multiple arrays in Bash - arrays

perhaps set the delimiter to the actual delimiter - and do the processing in the read loop instead of deferring and using arrays. $ while IFS=- read -r t d s; do echo "Creating Table ID: ${t// }, with Table Description: ${d// } and schema: ${s// }"; done < file

Related

diff two arrays each containing files paths into a third array (for removal)

How to speed up bash random name generation?

Splitting files into multiple files based on some pattern and take some information

Shell script: Sed substitution throwing unknown command: ` '

shell script array won't populate from for loop

Categories

Resources