SGE array jobs and R - arrays

I currently have a R script written to perform a population genetic simulation, then write a table with my results to a text file. I would like to somehow run multiple instances of this script in parallel using an array job (my University's cluster uses SGE), and when its all done I will have generated results files corresponding to each job (Results_1.txt, Results_2.txt, etc.).
Spent the better part of the afternoon reading and trying to figure out how to do this, but haven't really found anything along the lines of what I am trying to do. I was wondering if someone could provide and example or perhaps point me in the direction of something I could read to help with this.

To boil down mithrado's answer to the bare essentials:
Create job script, pop_gen.bash, that may or may not take SGE task id argument as input, storing results in specific file identified by same SGE task id:
#!/bin/bash
Rscript pop_gen.R ${SGE_TASK_ID} > Results_${SGE_TASK_ID}.txt
Submit this script as a job array, e.g. 1000 jobs:
qsub -t 1-1000 pop_gen.bash
Grid Engine will execute pop_gen.bash 1000 times, each time setting SGE_TASK_ID to value ranging from 1-1000.
Additionally, as mentioned above, via passing SGE_TASK_ID as command line variable to pop_gen.R you can use SGE_TASK_ID to write to output file:
args <- commandArgs(trailingOnly = TRUE)
out.file <- paste("Results_", args[1], ".txt", sep="")
# d <- "some data frame"
write.table(d, file=out.file)
HTH

I am not used to do this in R, but I've been using the same approach in python. Imagine that you have an script genetic_simulation.r and it has 3 parameter:
--gene_id --khmer_len and --output_file.
You will have one csv file, genetic_sim_parms.csv with n rows:
first_gene,10,/result/first_gene.txt
...
nth_gene,6,/result/nth_gene.txt
A import detail is the first lane of your genetic_simulation.r. It needs to tell which executable the cluster is going to will use. You might need to tweak its parameters as well, depending on your setup, it will look like to:
#!/path/to/Rscript --vanilla
And finally, you will need a array-job bash script:
#!/bin/bash
#$ -t 1:N < change to number of rows in genetic_sim_parms.csv
#$ -N genetic_simulation.r
echo "Starting on : $(date)"
echo "Running on node : $(hostname)"
echo "Current directory : $(pwd)"
echo "Current job ID : $JOB_ID"
echo "Current job name : $JOB_NAME"
echo "Task index number : $SGE_TASK_ID"
ID=$(awk -F, -v "line=$SGE_TASK_ID" 'NR==line {print $1}' genetic_sim_parms.csv)
LEN=$(awk -F, -v "line=$SGE_TASK_ID" 'NR==line {print $2}' genetic_sim_parms.csv)
OUTPUT=$(awk -F, -v "line=$SGE_TASK_ID" 'NR==line {print $3}' genetic_sim_parms.csv)
echo "id is: $ID"
rscript genetic_simulation.r --gene_id $ID --khmer_len $LEN --output_file $OUTPUT
echo "Finished on : $(date)"
Hope this helps!

Related

Perform multiple functions on each object in an array and redirect the results to a file

How to perform multiple functions on each object in an array and output the results to a file?
Heres my array, its the value of a command, that command lists all text files in the current directory:
#!/bin/sh
declare -a FILES
FILES=( $(find . -mindepth 1 -maxdepth 1 -type f -iname "*.txt") )
As you can see below, it works:
# Print the full array:
echo "${FILES[#]}"
./1.txt ./2.txt ./3.txt
# Print the number of objects in the array:
echo "${#FILES[#]}"
3
# Loop through each object in the array and print its name:
for file in "${FILES[#]}"
do
echo "$file"
done
./1.txt
./2.txt
./3.txt
For each object in the array I need to perform three functions:
# Use grep to return a whole line:
GREP="$(cat "$file" | grep 'String_A')"
# Use awk to return part of a line only (the second column, after the first colon)
AWK="$(cat "$file" | awk -F: '/String_B/ {print $2; exit}')"
# Use sed to return all lines from C to E (C + D + E)
SED="$(cat "$file" | sed -nE '/String_C/,/String_E/p')"
For this, a "for loop":
for file in "${FILES[#]}"
do
echo "$GREP : $AWK : $SED"
done
File_1_Grep_String_A : File_1_Awk_String_B : File_1_Sed_String_C
File_1_Sed_String_D
File_1_Sed_String_E
File_3_Grep_String_A : File_3_Awk_String_B : File_3_Sed_String_C
File_3_Sed_String_D
File_3_Sed_String_E
File_2_Grep_String_A : File_2_Awk_String_B : File_2_Sed_String_C
File_2_Sed_String_D
File_2_Sed_String_E
This is all the that I require, but I need this info written to a file called results,
the problem is, when I redirect this output to a file, only the first object in the array is parsed:
for file in "${FILES[#]}"
do
echo "$GREP : $AWK : $SED" > Some_File
done
cat Some_File
File_2_Grep_String_A : File_2_Awk_String_B : File_2_Sed_String_C
File_2_Sed_String_D
File_2_Sed_String_E
Summary:
The command works correctly when it outputs to stdout but not when redirected to a file.
I have found similar questions here on stackoverflow (performing functions on arrays) but none redirect output to a file.
I find this quite weird, I have never had a command output to stdout fine but unable to get that in text elsewhere.
Whilst I'm new here, my last couple questions wasnt recieved well by some members as I was too brief,
so I think this time I have fully explained and shown the output step-by-step.
Also, to save you the burden of making an identical test environment and copy/paste'ing all of this,
I put it up on github for you to clone, hopefully this helps.
Again, to ease things, in there you will find two scripts,
the main one in question, called script.sh (this is the one that needs fixing, its clean/uncommented)
but I also included one called tests.sh, this is basically all the steps I have taken,
it contains alot of comments that will further help you understand better than I can explain here.
To be honest, I would definately have a quick look at that first,
its also safe to execute (it contains all the working bits I have shown above, like printing the array),
this saves you typing out commands to test the array etc.
Thank you in advance! I hope I've done better this time?!?
git clone https://github.com/5c0tt-b0t/stackoverflow_test
Test script output:
########## TESTS: ##########
FULL ARRAY:
./1.txt ./3.txt ./2.txt
COUNT:
3
OBJECTS:
./1.txt
OBJECTS:
./3.txt
OBJECTS:
./2.txt
########## INFO THAT I NEED: ##########
File_1_Grep_String_A : File_1_Awk_String_B : File_1_Sed_String_C
File_1_Sed_String_D
File_1_Sed_String_E
File_3_Grep_String_A : File_3_Awk_String_B : File_3_Sed_String_C
File_3_Sed_String_D
File_3_Sed_String_E
File_2_Grep_String_A : File_2_Awk_String_B : File_2_Sed_String_C
File_2_Sed_String_D
File_2_Sed_String_E
You can redirect the output of the whole loop:
for file in "${FILES[#]}"
do
echo "$GREP : $AWK : $SED"
done > Some_File
Also, use #!/bin/bash when using bashisms like arrays, /bin/sh doesn't necessarily support arrays.

Randomly generating invoice IDs - moving text database into script file?

I've come up with the following bash script to randomly generate invoice numbers, preventing duplications by logging all generated numbers to a text file "database".
To my surprise the script actually works, and it seems robust (although I'd be glad to have any flaws pointed out to me at this early stage rather than later on).
What I'm now wondering is whether it's at all possible to move the "database" of generated numbers into the script file itself. This would allow me to rely on and keep track of just the one file rather than two separate ones.
Is this at all possible, and if so, how? If it isn't a good idea, what valid reasons are there not to do so?
#!/usr/bin/env bash
generate_num() {
#num=$(head /dev/urandom | tr -dc '[:digit:]' | cut -c 1-5) [Original method, no longer used]
num=$(shuf -i 10000-99999 -n 1)
}
read -p "Are you sure you want to generate a new invoice ID? [Y/n] " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]
then
generate_num && echo Generating a random invoice ID and checking it against the database...
sleep 2
while grep -xq "$num" "ID_database"
do
echo Invoice ID \#$num already exists in the database...
sleep 2
generate_num && echo Generating new random invoice ID and checking against database...
sleep 2
done
while [[ ${#num} -gt 5 ]]
do
echo Invoice ID \#$num is more than 5 digits...
sleep 2
generate_num && echo Generating new random invoice ID and checking against database...
sleep 2
done
echo Generated random invoice ID \#$num
sleep 1
echo Invoice ID \#$num does not exist in database...
sleep 2
echo $num >> "ID_database" && echo Successfully added Invoice ID \#$num to the database.
else
echo "Exiting..."
fi
I do not recommend this because:
These things are fragile. One bad edit and your invoice database is corrupt.
It makes version control a pain. Each new version of the script should preferably be checked in. You could add logic to make sure that "$mydir" is an empty directory when you run the script (except for "$myname", .git and other git-related files) then run git -C "$mydir" init if "$mydir"/.git doesn't exist. Then for each database update, git -C "$mydir" add "$myname" and git -C "$mydir" commit -m "$num". It's just an idea to explore...
Locking - It's possible to do file locking to make sure that not two users run the script at the same time, but it adds to the complexity so I didn't bother. If you feel that's a risk, you need to add that.
... but you want a self-modifying script, so here goes.
This just adds a new invoice number to its internal database for each time you run it. I've explained what goes on as comments. The last line should read __INVOICES__ (+ a newline) if you copy the script.
As always when dealing with things like this, remember to make a backup before making changes :-)
As it's currently written, you can only add one invoice per run. It shouldn't be hard to move things around (you need a new tempfile) to get it to add more than one if you need that.
#!/bin/bash
set -e # exit on error - imporant for this type of script
#------------------------------------------------------------------------------
myname="$0"
mydir=$(dirname "$myname")
if [[ ! -w $myname ]]; then
echo "ERROR: You don't have permission to update $myname" >&2
exit 1
fi
# create a tempfile to be able to update the database in the file later
#
# set -e makes the script end if this fails:
temp=$(mktemp -p "$mydir")
trap "{ rm -f "$temp"; }" EXIT # remove the tempfile if we die for some reason
# read current database from the file
readarray -t ID_database <<< $(sed '0,/^__INVOICES__$/d' "$0")
#declare -p ID_database >&2 # debug
#------------------------------------------------------------------------------
# a function to check if a number is already in the db
is_it_taken() {
local num=$1
# return 1 (true, yes it's taken) if the regex found a match
[[ ! " ${ID_database[#]} " =~ " ${num} " ]]
}
generate_num() {
local num
(exit 1) # set $? to 1
# loop until $? becomes 0
while (( $? )); do
num=$(shuf -i 10000-99999 -n 1)
is_it_taken "$num"
done
# we found a free number
echo $num
}
add_to_db() {
local num=$1
# add to db in memory
ID_database+=($num)
# add to db in file:
# copy the script to the tempfile
cp -pf "$myname" "$temp"
# add the new number
echo $num >> "$temp"
# move the tempfile into place
mv "$temp" "$myname"
}
#------------------------------------------------------------------------------
num=$(generate_num)
add_to_db $num
# your business logic goes here:
echo "All current invoices:"
for invoice in ${ID_database[#]}
do
echo ">$invoice<"
done
#------------------------------------------------------------------------------
# leave the rest untouched:
exit
__INVOICES__
Edited
To answer the question you asked -
Make sure your file ends with an explicit exit statement.
Without some sort of branching it won't execute past that, so unless there is a gross parsing error anything below could be used as storage space. Just
echo $num >> $0
If you write your records directly onto the bottom of the script, the script grows, but ...relatively harmlessly. Just make sure your grep pattern doesn't grab any lines of code, though grep -E '^\d[%]$' seems pretty safe.
This is only ever going to give you a max of ~90k id's, and spends unneeded time and cycles on redundancy checking. Is there a limit on the length of the value?
If you can assure there won't be more than one invoice processed per second,
date +%s >> "ID_database" # the UNIX epoch, seconds since 00:00:00 01/01/1970
If you need more precision that that,
date +%Y%m%d%H%M%S%N
will output Year month day hour minute second nanoseconds, which is both immediate and "pretty safe".
date +%s%N # epoch with nanoseconds
is shorter, but doesn't have the convenient side effect of automatically giving you the date and time of invoice creation.
If you absolutely need to guarantee uniqueness and nanoseconds isn't good enough, use a lock of some sort, and maybe a more fine-grained language.
On the other hand, if minutes are unique enough, you could use
date +%y%m%d%H%M
You get the idea.

bash script to collect pids in array

Working on a simple bash script that I can use to ultimately tell me if a rogue process is running that we don't want - this one will ultimately be running with a different parent pid. a monitor of sorts. Where I'm having an issue is getting all the specific pids that I want into an array that I can perform some actions on. Script first:
#!/bin/bash
rmanRUNNING=`ps -ef|grep /etc/process/process.conf|egrep -v grep|wc -l`
if [ $rmanRUNNING -gt 0 ]
then
rmanPPID=( $(ps -ef|grep processname|egrep -v grep|egrep -v /etc/process/process.conf|awk '{ printf $3 }') )
for i in "${rmanPPID[#]}"
do
:
echo $i
done
fi
So, goal is to check for existence of the main process, this is the one running with the config file in it, the first variable tells me this. Next, if it's running (based on the count greather than 0) the intention is to populate an array with all the parent pids, excluding what would be determined as the main process (we don't need to analyze this one). So, in the array definition we get the list of processes, grep process name, egrep -v the grep output, also egrep -v the "main" process and then awk the parent pids then iterate through and attempt to echo each one individually (more would be done in this section, but it's not working). Unfortunately, when I output $i all of the parent pids are simply concatenated together in one long string. If I try to output a specific array item I get an empty output.
Obviously the question here is, what's wrong with my array definition that is preventing it from being declared as an array, or some other odd thing.
This is on RHEL, 6.2 on the test environment, probably 7 in production by the time this is live.
Full disclosure, I'm a monitoring engineer, not an SA - definitely not a bash scripter by nature!
Thanks in advance.
EDIT: just for clarity, an echo to screen of the PIDs is NOT the end desired output, it's just a simple way to test that I'm getting back what I'm expecting. Based on comment below I believe pgrep type output is the preferred output. In the end I'll be tying these pids back one at a time against the original process to ensure that it is the parent, and if it is not I'll spit out an error.
It's not so much $i that will be one concatenated number, as well as that your array is just a single element of that concatenated number. This is because the output of awk is concatenated together, without any separator.
If you simply add a space within awk, you may get what you want:
rmanPPID=( $(ps -ef|grep processname | ... | awk '{ printf "%d ", $3 }') )
or even simpler, use print instead of printf:
rmanPPID=( $(ps -ef|grep processname | ... | awk '{ print $3 }') )
(Thanks to Jonathan Leffler, see comment below.)

How can I split bash CLI arguments into two separate arrays for later usage?

New to StackOverflow and new to bash scripting. I have a shell script that is attempting to do the following:
cd into a directory on a remote machine. Assume I have already established a successful SSH connection.
Save the email addresses from the command line input (these could range from 1 to X number of email addresses entered) into an array called 'emails'
Save the brand IDs (integers) from the command line input (these could range from 1 to X number of brand IDs entered) into an array called 'brands'
Use nested for loops to iterate over the 'emails' and 'brands' arrays and add each email address to each brand via add.py
I am running into trouble splitting up and saving data into each array, because I do not know where the command line indices of the emails will stop, and where the indices of the brands will begin. Is there any way I can accomplish this?
command line input I expect to look as follows:
me#some-remote-machine:~$ bash script.sh person1#gmail.com person2#gmail.com person3#gmail.com ... personX#gmail.com brand1 brand2 brand3 ... brandX
The contents of script.sh look like this:
#!/bin/bash
cd some/directory
emails= ???
brands= ???
for i in $emails
do
for a in $brands
do
python test.py add --email=$i --brand_id=$a --grant=manage
done
done
Thank you in advance, and please let me know if I can clarify or provide more information.
Use a sentinel argument that cannot possibly be a valid e-mail address. For example:
$ bash script.sh person1#gmail.com person2#gmail.com '***' brand1 brand2 brand3
Then in a loop, you can read arguments until you reach the non-email; everything after that is a brand.
#!/bin/bash
cd some/directory
while [[ $1 != '***' ]]; do
emails+=("$1")
shift
done
shift # Ignore the sentinal
brands=( "$#" ) # What's left
for i in "${emails[#]}"
do
for a in "${brands[#]}"
do
python test.py add --email="$i" --brand_id="$a" --grant=manage
done
done
If you can't modify the arguments that will be passed to script.sh, then perhaps you can distinguish between an address and a brand by the presence or absence of a #:
while [[ $1 = *#* ]]; do
emails+=("$1")
shift
done
brands=("$#")
I'm assuming that the number of addresses and brands are independent. Otherwise, you can simply look at the total number of arguments $#. Say there are N of each. Then
emails=( "${#:1:$#/2}" ) # First half
brands=( "${#:$#/2+1}" ) # Second half

Unique file names in a directory in unix

I have a capture file in a directory in which some logs are being written in a file
word.cap
now there is a script in which when its size becomes exactly 1.6Gb then it clears itself and prepares files in below format in same directory-
word.cap.COB2T_1389889231
word.cap.COB2T_1389958275
word.cap.COB2T_1390035286
word.cap.COB2T_1390132825
word.cap.COB2T_1390213719
Now i want to pick all these files in a script one by one and want to perform some actions.
my script is-
today=`date +%d_%m_%y`
grep -E '^IPaddress|^Node' /var/rawcap/word.cap.COB2T* | awk '{print $3}' >> snmp$today.txt
sort -u snmp$today.txt > snmp_final_$today.txt
so, what should i write to pick all file names of above mentioned format one by one as i will place this script in crontab,but i don't want to read main word.cap file as that is being edited.
As per your comment:
Thanks, this is working but i have a small issue in this. There are
some files which are bzipped i.e. word.cap.COB2T_1390213719.bz2, so i
dont want these files in list, so what should be done?
You could add a condition inside the loop:
for file in word.cap.COB2T*; do
if [[ "$file" != *.bz2 ]]; then
# Do something here
echo ${file};
fi
done

Resources