How to speed up bash random name generation? - database

i have problem with my code performance. It is running very slow. I need to generate million+ random persons for my postgres db and insert them into db. Person has parameters name,birthdate,gender,age. I created lists for first names and last names from which i am randomly selecting name. Can someone help me?
Here is my code:
#docker params
name="`docker ps | rev | cut -d " " -f1 | rev | grep -v NAMES`"
dbs_name="DBS_projekt"
#load names from files
firstName=(`cat generatorSource/firstNames.txt`)
firstNameCount="`wc -l generatorSource/firstNames.txt | tr -s ' ' | cut -d ' ' -f2`"
secondName=(`cat generatorSource/lastNames.txt`)
secondNameCount="`wc -l generatorSource/lastNames.txt| tr -s ' ' | cut -d ' ' -f2`"
#gender array
gender=("Male" "Female" "Other")
#actual date
now=$(date | rev | cut -d " " -f1 | rev)
array=()
for ((x = 1; x <= 1000;x++))
do
array+="INSERT INTO persons(name,birthdate,gender,age) VALUES"
for (( n=1; n<=1000; n++ ))
do
secondrand=$(( ( RANDOM % $secondNameCount ) ))
firstrand=$(( ( RANDOM % $firstNameCount ) ))
genderand=$(( ( RANDOM % 3 ) ))
year=$(( ( RANDOM % 118 ) + 1900 ))
month=$(((RANDOM % 12) + 1))
day=$(((RANDOM % 28) + 1))
age=$(expr $now - $year)
if [ $n -eq 1000 ]; then
array+="('${firstName[$firstrand]}
${secondName[$secondrand]}','$year-$month-$day',
'${gender[$genderand]}','$age');"
else
array+="('${firstName[$firstrand]}
${secondName[$secondrand]}','$year-$month-$day',
'${gender[$genderand]}','$age'),"
fi
done
done
#run psql in docker and run insert commands
docker exec -i $name psql -U postgres << EOF
\c $dbs_name
$array
EOF

Note that you declare "array" as an array, but you use it as a string.
array=()
...
array+="INSERT INTO persons(name,birthdate,gender,age) VALUES"
This is what's happening:
$ array=()
$ declare -p array
declare -a array='()'
$ array+="first"
$ array+="second"
$ declare -p array
declare -a array='([0]="firstsecond")'
To insert an element into an array, you must use parentheses:
$ array=()
$ array+=("first")
$ array+=("second")
$ declare -p array
declare -a array='([0]="first" [1]="second")'
I suspect this may be one source of slowness: you're constructing one gigantic string. Add the parentheses as shown, and then change the docker call to
IFS=$'\n'
docker exec -i $name psql -U postgres << EOF
\c $dbs_name
${array[*]}
EOF

Related

Convert field names to lower case using miller

I would like to use miller (mlr) to convert column names to lower case. The closest I get is using the rename verb with a regular expression. \L should change the case, but instead the the column names are getting prefixed by "\L".
I'm using macOS Catalina and miller 5.10.0
echo -e 'A,B,C\n1,2,3' | mlr --csv --opprint rename -r '(.*),\L\1'
prints
\LA \LB \LC
1 2 3
But I would like it to print
a b c
1 2 3
Two examples ways:
echo -e 'A,B,C\n1,2,3' | mlr --csv put '
map inrec = $*;
$* = {};
for (oldkey, value in inrec) {
newkey = tolower(oldkey);
$[newkey] = value;
}
'
or
echo -e 'A,B,C\n1,2,3' | mlr --csv -N put -S 'if (NR == 1) {for (k in $*) {$[k] = tolower($[k])}}'
Sometimes, standard tools are easier to use:
echo -e 'A,B,C\n1,2,3' | awk 'NR == 1 {print tolower($0); next} 1'
UPDATE
with Miller:
echo -e 'A,B,C\n1,2,3' |
mlr --csv -N put 'NR == 1 {for (k,v in $*) {$[k] = tolower(v)}}'

diff two arrays each containing files paths into a third array (for removal)

In the function below you will see notes on several attempts to solve this problem; each attempt has a note indicating what went wrong. Between my attempts there is a line from another question here which purports to solve some element of the matter. Again, I've added a note indicating what that is supposed to solve. My brain is mush at this point. What is the stupid simple thing I've overlooking?
function func_removeDestinationOrphans() {
readarray -d '' A_Destination_orphans < <( find "${directory_PMPRoot_destination}" -type f -print0 )
for (( i = 0 ; i < ${#A_Destination_orphans[#]} ; i++ )) ; do
printf '%s\n' "→ ${A_Destination_orphans[${i}]}" # path to each track
done
printf '%b\n' ""
# https://stackoverflow.com/questions/2312762/compare-difference-of-two-arrays-in-bash
# echo ${Array1[#]} ${Array2[#]} | tr ' ' '\n' | sort | uniq -u ## original
# Array3=(`echo ${Array1[#]} ${Array2[#]} | tr ' ' '\n' | sort | uniq -u `) ## store in array
# A_Destination_orphans_diff=(`echo "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | tr ' ' '\n' | sort | uniq -u `) # drops file path after space
# printf "%s\0" "${Array1[#]}" "${Array2[#]}" | sort -z | uniq -zu ## newlines and white spaces
# A_Destination_orphans_diff=($( printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort -z | uniq -zu )) # throws warning and breaks at space but not newline
# printf '%s\n' "${Array1[#]}" "${Array2[#]}" | sort | uniq -u ## manage spaces
# A_Destination_orphans_diff=($( printf '%s\n' "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort | uniq -u )) # breaks at space and newline
# A_Destination_orphans_diff="($( printf '%s\n' "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort | uniq -u ))" # creates string surrounded by ()
# A_Destination_orphans_diff=("$( printf '%s\n' "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort | uniq -u )") # creates string
# A_Destination_orphans_diff=($( printf '%s\n' ${A_Destination_dubUnders[#]} ${A_Destination_orphans[#]} | sort | uniq -u )) # drops file path after space
for (( i = 0 ; i < ${#A_Destination_orphans_diff[#]} ; i++ )) ; do
printf '%s\n' "→ ${A_Destination_orphans_diff[${i}]}" # path to each track
done
printf '%b\n' ""
for (( i = 0 ; i < ${#A_Destination_orphans_diff[#]} ; i++ )) ; do
echo # rm "${A_Destination_orphans_diff[i]}"
done
func_EnterToContinue
}
This throws warning and breaks at space but not newline because you build the array with direct assignment of syntax construct. When an entry contains spaces, it also splits break to a new entry.
A_Destination_orphans_diff=($( printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" | sort -z | uniq -zu ))
To avoid the issue of the method above, you can mapfile/readarray a null delimited entries stream.
mapfile -t -d '' A_Destination_orphans_diff < <(
printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" |
sort -z |
uniq -zu
)
In case your shell version is too old to support mapfile you can perform the same task with IFS=$'\37' read -r -d '' -a array.
$'\37' is shell's C-Style string syntax with octal code 37, which is ASCII 31 US for Unit Separator:
IFS=$'\37' read -r -d '' -a A_Destination_orphans_diff < <(
printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" |
sort -z |
uniq -zu |
xargs -0 printf '%s\37'
)
To remove all files not present in A_Destination_dubUnders array you could:
func_removeDestinationOrphans() {
find "${directory_PMPRoot_destination}" -type f -print0 |
sort -z |
join -z -v1 -t '' - <(printf "%s\0" "${A_Destination_dubUnders[#]}" | sort -z) |
xargs -0 echo rm
}
Use join or comm to find elements not present in one list and present in another list. I am usually wrong about -v1, so try with -v2 if it echoes the elements from wrong list (I do not understand if you want to remove files present in A_Destination_dubUnders list or not present, you did not specify that).
Note that function name() is a mix of ksh and posix function definition. Just name() {. See bash hackers wiki obsolete
Here is the working version with modifications thanks to suggested input from the first two respondents (thanks!).
function func_removeDestinationOrphans() {
printf '%s\n' " → Purge playlist orphans: " ""
printf '%b\n' "First we will remove any files not present in your proposed playlist. "
func_EnterToContinue
bash_version="$( bash --version | head -n1 | cut -d " " -f4 | cut -d "(" -f1 )"
if printf '%s\n' "4.4.0" "${bash_version}" | sort -V -C ; then
readarray -d '' A_Destination_orphans < <( find "${directory_PMPRoot_destination}" -type f -print0 ) # readarray or mapfile -d fails before bash 4.4.0
readarray -t -d '' A_Destination_orphans_diff < <(
printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" |
sort -z |
uniq -zu
)
else
while IFS= read -r -d $'\0'; do
A_Destination_orphans+=( "$REPLY" )
done < <( find "${directory_PMPRoot_destination}" -type f -print0 )
IFS=$'\37' read -r -d '' -a A_Destination_orphans_diff < <(
printf "%s\0" "${A_Destination_dubUnders[#]}" "${A_Destination_dubUnders[#]}" "${A_Destination_orphans[#]}" |
sort -z |
uniq -zu |
xargs -0 printf '%s\37'
)
fi
if [[ ! "${A_Destination_orphans_diff[*]}" = '' ]] ; then
for (( i = 0 ; i < ${#A_Destination_orphans_diff[#]} ; i++ )) ; do
rm "${A_Destination_orphans_diff[i]}"
done
fi
}
If you would like to see the entire Personal Music Player sync script, you can find that via my GitHub.

Putting multiline awk command in a for loop not printing the variable

I have a command as follows, which takes lines from the Allergens file, based on lines from the IDs file.
awk '
FNR==NR{
a[$1]
next
}
/^Query/ || $2 in a
' IDs C100_Allergens | grep -B 1 'Hit: ' | grep -v '^--' > C100_Allergens_matches.txt
However, I have numerous sample_Allergens files, and want to run it in a loop as such, where list is a file with different sample names:
for i in `cat list`
do
awk '
FNR==NR{
a[$1]
next
}
/^Query/ || $2 in a
' IDs "$i"_Allergens | grep -B 1 'Hit: ' | grep -v '^--' > "$i"_Allergens_matches.txt
done
I tried this loop, including using the variable flag for awk, i.e. -v i="$i":
for i in `cat list`
do
awk -v i="$i" '
FNR==NR{
a[$1]
next
}
/^Query/ || $2 in a
' IDs "$i"_Allergens | grep -B 1 'Hit: ' | grep -v '^--' > "$i"_Allergens_matches.txt
done
I only keep getting empty files. Thanks in advance for your help!

Read delimited multiline string file into multiple arrays in Bash

I began with a file like so:
Table_name1 - Table_desc1
Table_name2 - Table_desc2
...
...
I have a script that parses this file and splits them into two arrays:
declare -a TABLE_IDS=()
declare -a TABLE_DESCS=()
while IFS= read -r line || [[ -n "${line}" ]]; do
TABLE_IDS[i]=${line%' '-' '*}
TABLE_DESCS[i++]=${line#*' '-' '}
done < "${TABLE_LIST}"
for i in "${!TABLE_IDS[#]}"; do
echo "Creating Table ID: "${TABLE_IDS[i]}", with Table Description: "${TABLE_DESCS[i]}""
done
This works really well, with no problems whatsoever.
I wanted to extend this and make the file:
Table_name1 - Table_desc1 - Table_schema1
Table_name2 - Table_desc2 - Table_schema2
...
...
For this, I tried:
declare -a TABLE_IDS=()
declare -a TABLE_DESCS=()
while IFS= read -r line || [[ -n "${line}" ]]; do
TABLE_IDS[i]="$(echo $line | cut -f1 -d - | tr -d ' ')"
TABLE_DESCS[i++]="$(echo $line | cut -f2 -d - | tr -d ' ')"
TABLE_SCHEMAS[i++]="$(echo $line | cut -f3 -d - | tr -d ' ')"
done < "${TABLE_LIST}"
for i in "${!TABLE_IDS[#]}"; do
echo "Creating Table ID: "${TABLE_IDS[i]}", with Table Description: "${TABLE_DESCS[i]}" and schema: "${TABLE_SCHEMAS[i]}""
done
And while this will faithfully list all the Table IDs and the Table descriptions, the schemas are omitted. I tried:
while IFS= read -r line || [[ -n "${line}" ]]; do
TABLE_IDS[i]="$(echo $line | cut -f1 -d - | tr -d ' ')"
TABLE_DESCS[i]="$(echo $line | cut -f2 -d - | tr -d ' ')"
TABLE_SCHEMAS[i]="$(echo $line | cut -f3 -d - | tr -d ' ')"
done < "${TABLE_LIST}"
And it returns just the last line's Table name, description AND schema. I suspect this is an indexing/looping problem, but am unable to figure out what exactly is going wrong. Please help! Thanks!
perhaps set the delimiter to the actual delimiter - and do the processing in the read loop instead of deferring and using arrays.
$ while IFS=- read -r t d s;
do
echo "Creating Table ID: ${t// }, with Table Description: ${d// } and schema: ${s// }";
done < file

Splitting files into multiple files based on some pattern and take some information

I'm working with a lot of files with this structure:
BEGIN
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=1393
PEPMASS=946.3980102539062
CHARGE=3.0+
USER03=
SEQ=DDDIAAL
TAXONOMY=9606
272.228 126847.000
273.252 33795.000
END
BEGIN IONS
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=1383
PEPMASS=911.3920288085938
CHARGE=2.0+
USER03=
SEQ=QGKFEAAETLEEAAMR
TAXONOMY=9606
1394.637 71404.000
1411.668 122728.000
END
BEGIN IONS
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=2965
PEPMASS=946.3900146484375
CHARGE=3.0+
TAXONOMY=9606
1564.717 92354.000
1677.738 33865.000
END
This structure is repeated thousands of times but with different data inside. As you can see, between some begin-end, sometimes SEQ and USER03 are not there. This is because the protein is not identified ... And here comes my problem.
I would like to know how many proteins are identified and how many are unidentified. To do this I was trying this:
for i in $(ls *.txt ); do
echo $i
awk '/^BEGIN/{n++;w=1} n&&w{print > "./cache/out" n ".txt"} /^END/{w=0}' $i
done
I found this here (Split a file into multiple files based on a pattern and name the new files by the search pattern in Unix?)
And then use the outputs and classify them:
for i in $(ls cache/*.txt ); do
echo $i
if grep -q 'SEQ' $i; then
mv $i ./archive_identified
else
mv $i ./archive_unidentified
fi
done
After this, I'd like to take some data (Example: spectrum, USER03, SEQ, TAXONOMY) from classified files.
for I in $( ls archive_identified/*.txt ); do
echo $i
grep 'SEQ' $i | cut -d "=" -f2- | tr ',' '\n' >> ./sequences_ide.txt
grep 'TAXONOMY' $i | cut -d "=" -f2- | tr ',' '\n' >> ./taxonomy_ide.txt
grep 'USER' $i | cut -d "=" -f2- >> ./modifications_ide.txt
grep 'TITLE' $i | sed 's/^.*\(spectrum.*\)/\1/g' | cut -d "=" -f2- >> ./spectrum.txt
done
for i in $( ls archive_unidentified/*.txt ); do
echo $i
grep 'SEQ' $i | cut -d "=" -f2- | tr ',' '\n' >> ./sequences_unide.txt
grep 'TAXONOMY' $i | cut -d "=" -f2- | tr ',' '\n' >> ./taxonomy_unide.txt
grep 'USER' $i | cut -d "=" -f2- >> ./modifications_unide.txt
grep 'TITLE' $i | sed 's/^.*\(spectrum.*\)/\1/g' | cut -d "=" -f2- >> ./spectrum_unide.txt
done
The problem is that the first part of the script takes too much time due to the large size of the data (12-15gb.). Is there any way to do this easier?
Thank you in advance.
You can do all in one awk script. awk can iterate through all rows (records) so you don't need an external loop. For example, for the data file you provided
$ awk -v RS= '/\nSEQ/ {seq++; print > "file_path_with_seq" NR ".txt"; next}
{noseq++; print > "file_path_without_seq" NR ".txt"}
END { print "with seq:", seq;
print "without seq:", noseq}' file
will print
with seq: 2
without seq: 1
and produces the files
$ head file_path_with*
==> file_path_with_seq1.txt <==
BEGIN
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=1393
PEPMASS=946.3980102539062
CHARGE=3.0+
USER03=
SEQ=DDDIAAL
TAXONOMY=9606
272.228 126847.000
273.252 33795.000
END
==> file_path_with_seq2.txt <==
BEGIN IONS
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=1383
PEPMASS=911.3920288085938
CHARGE=2.0+
USER03=
SEQ=QGKFEAAETLEEAAMR
TAXONOMY=9606
1394.637 71404.000
1411.668 122728.000
END
==> file_path_without_seq3.txt <==
BEGIN IONS
TITLE=id=PRD000012;PRIDE_Exp_Complete_Ac_1645.xml;spectrum=2965
PEPMASS=946.3900146484375
CHARGE=3.0+
TAXONOMY=9606
1564.717 92354.000
1677.738 33865.000
END

Resources