qsub array job submission - arrays

I am currently trying to run an array job on the "big-computer" at my Uni.
I'm new to Unix and bash and I've been having a hard time getting this to work.
The folder set up is as follow:
model1
- model1.inp
- model1.num
model2
- model2.inp
- model2.num
startup.sh
runAModel.sh
modelArray.sh
Due to restrictions on how long I can run a single job, I was asked to break up my simulations. So I need to run each model 5 times over, each time the model reads the input file .inp and outputs another input file for the subsequent run.
The code below used to work until a week a go or so but it doesn't seem to function anymore. I wonder if I didn't mess something up in there.
I suspected it might be in the line qcmd="qsub -N $modelName -t 1:5 ../../modelArray.sh" of runAModel.sh and that I should replace 1:5 to 1-5 but that didn't seem to work.
I use qstat to see my job and where I would expect to see a list of 5 queued jobs I only see one.
I was given three files to run:
startup.sh :
find . -mindepth 2 -type d -exec ./runAModel.sh {} \;
runAModel.sh :
#!/bin/bash
echo starting model in $1
cd $1 # go into the model directory
modelName=$(basename $PWD)
for f in *
do
dos2unix $f
done
qcmd="qsub -N $modelName -t 1:5 ../../modelArray.sh"
qq=`$qcmd` # runs a qsub command
# extract the job number
qt=`echo $qq | awk '{print $3}'`
jobid=${qt%%.*}
qrls $jobid.1
and modelArray.sh :
#!/bin/bash
# run program, invoke in model directory with input files.
# we want to run in the current working directory
#$ -cwd
# we want to run mpi with 4 cores on he same node:
#$ -pe sharedmem 4
# make a generous guess at the time we need
#$ -l h_rt=30:00:00
# force reservation
#$ -R y
# use 4G per process
#$ -l h_vmem=4G
# hold the array
#$ -h
echo I am task $SGE_TASK_ID in $JOB_ID with $SGE_TASK_LAST tasks in total
echo on $HOSTNAME
date
# run our model - set modules, then get the model name
echo "set modules"
. /etc/profile.d/modules.sh
PROGRAMBUILD=/exports/programlocation
. $PROGRAMBUILD/loadModules.sh
modelName=$(basename $PWD)
echo mpirun -np 4 $PROGRAMBUILD/bin/program $modelName
mpirun -np 4 $PROGRAMBUILD/bin/program $modelName
if [ $SGE_TASK_ID == $SGE_TASK_LAST ]
then
echo I am last task
else
# release the next task....
# next task in this array:
next=$((SGE_TASK_ID+1))
echo insert a test that this task in the array job was successful
echo if so, release next task
echo releasing $next
ssh login01.***.uk qrls $JOB_ID.$next
if [[ "$?" -ne 0 ]]; then
echo failed to qrls $pid
fi
fi

Related

Bash: how to print and run a cmd array which has the pipe operator, |, in it

This is a follow-up to my question here: How to write bash function to print and run command when the command has arguments with spaces or things to be expanded
Suppose I have this function to print and run a command stored in an array:
# Print and run the cmd stored in the passed-in array
print_and_run() {
echo "Running cmd: $*"
# run the command by calling all elements of the command array at once
"$#"
}
This works fine:
cmd_array=(ls -a /)
print_and_run "${cmd_array[#]}"
But this does NOT work:
cmd_array=(ls -a / | grep "home")
print_and_run "${cmd_array[#]}"
Error: syntax error near unexpected token `|':
eRCaGuy_hello_world/bash$ ./print_and_run.sh
./print_and_run.sh: line 55: syntax error near unexpected token `|'
./print_and_run.sh: line 55: `cmd_array=(ls -a / | grep "home")'
How can I get this concept to work with the pipe operator (|) in the command?
If you want to treat an array element containing only | as an instruction to generate a pipeline, you can do that. I don't recommend it -- it means you have security risk if you don't verify that variables into your string can't consist only of a single pipe character -- but it's possible.
Below, we create a random single-use "$pipe" sigil to make that attack harder. If you're unwilling to do that, change [[ $arg = "$pipe" ]] to [[ $arg = "|" ]].
# generate something random to make an attacker's job harder
pipe=$(uuidgen)
# use that randomly-generated sigil in place of | in our array
cmd_array=(
ls -a /
"$pipe" grep "home"
)
exec_array_pipe() {
local arg cmd_q
local -a cmd=( )
while (( $# )); do
arg=$1; shift
if [[ $arg = "$pipe" ]]; then
# log an eval-safe copy of what we're about to run
printf -v cmd_q '%q ' "${cmd[#]}"
echo "Starting pipeline component: $cmd_q" >&2
# Recurse into a new copy of ourselves as a child process
"${cmd[#]}" | exec_array_pipe "$#"
return
fi
cmd+=( "$arg" )
done
printf -v cmd_q '%q ' "${cmd[#]}"
echo "Starting pipeline component: $cmd_q" >&2
"${cmd[#]}"
}
exec_array_pipe "${cmd_array[#]}"
See this running in an online sandbox at https://ideone.com/IWOTfO
Do this instead. It works.
print_and_run() {
echo "Running cmd: $1"
eval "$1"
}
Example usage:
cmd='ls -a / | grep -C 9999 --color=always "home"'
print_and_run "$cmd"
Output:
Running cmd: ls -a / | grep -C 9999 --color=always "home"
(rest of output here, with the word "home" highlighted in red)
The general direction is that you don't. You do not store the whole command line to be printed later, and this is not the direction you should take.
The "bad" solution is to use eval.
The "good" solution is to store the literal '|' character inside the array (or some better representation of it) and parse the array, extract the pipe parts and execute them. This is presented by Charles in the other amazing answer. It is just rewriting the parser that already exists in the shell. It requires significant work, and expanding it will require significant work.
The end result is, is that you are reimplementing parts of shell inside shell. Basically writing a shell interpreter in shell. At this point, you can just consider taking Bash sources and implementing a new shopt -o print_the_command_before_executing option in the sources, which might just be simpler.
However, I believe the end goal is to give users a way to see what is being executed. I would propose to approach it like .gitlab-ci.yml does with script: statements. If you want to invent your own language with "debug" support, do just that instead of half-measures. Consider the following YAML file:
- ls -a / | grep "home"
- echo other commands
- for i in "stuff"; do
echo "$i";
done
- |
for i in "stuff"; do
echo "$i"
done
Then the following "runner":
import yaml
import shlex
import os
import sys
script = []
input = yaml.safe_load(open(sys.argv[1], "r"))
for line in input:
script += [
"echo + " + shlex.quote(line).replace("\n", "<newline>"), # some unicode like ␤ would look nice
line,
]
os.execvp("bash", ["bash", "-c", "\n".join(script)])
Executing the runner results in:
+ ls -a / | grep "home"
home
+ echo other commands
other commands
+ for i in "stuff"; do echo "$i"; done
stuff
+ for i in "stuff"; do<newline> echo "$i"<newline>done<newline>
stuff
This offers greater flexibility and is rather simple, supports any shell construct with ease. You can try gitlab-ci/cd on their repository and read the docs.
The YAML format is only an example of the input format. Using special comments like # --- cut --- between parts and extracting each part with the parser will allow running shellcheck over the script. Instead of generating a script with echo statements, you could run Bash interactively, print the part to be executed and then "feed" the part to be executed to interactive Bash. This will alow to preserve $?.
Either way - with a "good" solution, you end up with a custom parser.
Instead of passing an array, you can pass the whole function and use the output of declare -f with some custom parsing:
print_and_run() {
echo "+ $(
declare -f "$1" |
# Remove `f() {` and `}`. Remove indentation.
sed '1d;2d;$d;s/^ *//' |
# Replace newlines with <newline>.
sed -z 's/\n*$//;s/\n/<newline>/'
)"
"$#"
}
cmd() { ls -a / | grep "home"; }
print_and_run cmd
Results in:
+ ls --color -F -a / | grep "home"
home/
It will allow for supporting any shell construct and still allow you to check it with shellcheck and doesn't require that much work.

Randomly generating invoice IDs - moving text database into script file?

I've come up with the following bash script to randomly generate invoice numbers, preventing duplications by logging all generated numbers to a text file "database".
To my surprise the script actually works, and it seems robust (although I'd be glad to have any flaws pointed out to me at this early stage rather than later on).
What I'm now wondering is whether it's at all possible to move the "database" of generated numbers into the script file itself. This would allow me to rely on and keep track of just the one file rather than two separate ones.
Is this at all possible, and if so, how? If it isn't a good idea, what valid reasons are there not to do so?
#!/usr/bin/env bash
generate_num() {
#num=$(head /dev/urandom | tr -dc '[:digit:]' | cut -c 1-5) [Original method, no longer used]
num=$(shuf -i 10000-99999 -n 1)
}
read -p "Are you sure you want to generate a new invoice ID? [Y/n] " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]
then
generate_num && echo Generating a random invoice ID and checking it against the database...
sleep 2
while grep -xq "$num" "ID_database"
do
echo Invoice ID \#$num already exists in the database...
sleep 2
generate_num && echo Generating new random invoice ID and checking against database...
sleep 2
done
while [[ ${#num} -gt 5 ]]
do
echo Invoice ID \#$num is more than 5 digits...
sleep 2
generate_num && echo Generating new random invoice ID and checking against database...
sleep 2
done
echo Generated random invoice ID \#$num
sleep 1
echo Invoice ID \#$num does not exist in database...
sleep 2
echo $num >> "ID_database" && echo Successfully added Invoice ID \#$num to the database.
else
echo "Exiting..."
fi
I do not recommend this because:
These things are fragile. One bad edit and your invoice database is corrupt.
It makes version control a pain. Each new version of the script should preferably be checked in. You could add logic to make sure that "$mydir" is an empty directory when you run the script (except for "$myname", .git and other git-related files) then run git -C "$mydir" init if "$mydir"/.git doesn't exist. Then for each database update, git -C "$mydir" add "$myname" and git -C "$mydir" commit -m "$num". It's just an idea to explore...
Locking - It's possible to do file locking to make sure that not two users run the script at the same time, but it adds to the complexity so I didn't bother. If you feel that's a risk, you need to add that.
... but you want a self-modifying script, so here goes.
This just adds a new invoice number to its internal database for each time you run it. I've explained what goes on as comments. The last line should read __INVOICES__ (+ a newline) if you copy the script.
As always when dealing with things like this, remember to make a backup before making changes :-)
As it's currently written, you can only add one invoice per run. It shouldn't be hard to move things around (you need a new tempfile) to get it to add more than one if you need that.
#!/bin/bash
set -e # exit on error - imporant for this type of script
#------------------------------------------------------------------------------
myname="$0"
mydir=$(dirname "$myname")
if [[ ! -w $myname ]]; then
echo "ERROR: You don't have permission to update $myname" >&2
exit 1
fi
# create a tempfile to be able to update the database in the file later
#
# set -e makes the script end if this fails:
temp=$(mktemp -p "$mydir")
trap "{ rm -f "$temp"; }" EXIT # remove the tempfile if we die for some reason
# read current database from the file
readarray -t ID_database <<< $(sed '0,/^__INVOICES__$/d' "$0")
#declare -p ID_database >&2 # debug
#------------------------------------------------------------------------------
# a function to check if a number is already in the db
is_it_taken() {
local num=$1
# return 1 (true, yes it's taken) if the regex found a match
[[ ! " ${ID_database[#]} " =~ " ${num} " ]]
}
generate_num() {
local num
(exit 1) # set $? to 1
# loop until $? becomes 0
while (( $? )); do
num=$(shuf -i 10000-99999 -n 1)
is_it_taken "$num"
done
# we found a free number
echo $num
}
add_to_db() {
local num=$1
# add to db in memory
ID_database+=($num)
# add to db in file:
# copy the script to the tempfile
cp -pf "$myname" "$temp"
# add the new number
echo $num >> "$temp"
# move the tempfile into place
mv "$temp" "$myname"
}
#------------------------------------------------------------------------------
num=$(generate_num)
add_to_db $num
# your business logic goes here:
echo "All current invoices:"
for invoice in ${ID_database[#]}
do
echo ">$invoice<"
done
#------------------------------------------------------------------------------
# leave the rest untouched:
exit
__INVOICES__
Edited
To answer the question you asked -
Make sure your file ends with an explicit exit statement.
Without some sort of branching it won't execute past that, so unless there is a gross parsing error anything below could be used as storage space. Just
echo $num >> $0
If you write your records directly onto the bottom of the script, the script grows, but ...relatively harmlessly. Just make sure your grep pattern doesn't grab any lines of code, though grep -E '^\d[%]$' seems pretty safe.
This is only ever going to give you a max of ~90k id's, and spends unneeded time and cycles on redundancy checking. Is there a limit on the length of the value?
If you can assure there won't be more than one invoice processed per second,
date +%s >> "ID_database" # the UNIX epoch, seconds since 00:00:00 01/01/1970
If you need more precision that that,
date +%Y%m%d%H%M%S%N
will output Year month day hour minute second nanoseconds, which is both immediate and "pretty safe".
date +%s%N # epoch with nanoseconds
is shorter, but doesn't have the convenient side effect of automatically giving you the date and time of invoice creation.
If you absolutely need to guarantee uniqueness and nanoseconds isn't good enough, use a lock of some sort, and maybe a more fine-grained language.
On the other hand, if minutes are unique enough, you could use
date +%y%m%d%H%M
You get the idea.

Bash manipulate and sort file content with arrays via loop

Purpose
Create a bash script which loops through certain commands and save the outputs of each command (they print only numbers) into a file (I guess the best way is to save them in a file?) with their dates (unix time) next to each output so we can use these stored values next time we run the script and it looped through again, see if there isn't any change in the outputs of commands within the last hour.
Example output
# ./script
command1 123123
command2 123123
Important notes
There are around 200 commands which the script will loop through.
There'll be new commands in the future so the script will have to check if this command exists in the saved file. If it already present, only compare it within the last hour to see if the number has changed since you last saved the file. If it doesn't exists, save it into the file so we can use it to compare next time.
Order of the commands which the script will run might be different as the commands increase/decrease/change. So if it's only like this for now;
# ./script
command1 123123
command2 123123
and you add a 3rd command in the future, the order might change (it is also not certain what kind of pattern it's following), for example;
# ./script
command1 123123
command3 123123
command2 123123
so we can't, for example, read it line by line and in this case, I believe the best way is to compare them with the command* names.
Structure for stored values
My presumed structure for stored values is like this (don't have to stick with this one tho);
command1 123123 unixtime
command2 123123 unixtime
About the said commands
The things I called commands are basically applications which are running on /usr/local/bin/ an can be accessed by directly running their names on the shell, like command1 getnumber and it will print you the number.
Since the commands are located in the /usr/local/bin/ and following a similar pattern, I'm first looping through the /usr/local/bin/ for command*. See below.
commands=`find /usr/local/bin/ -name 'command*'`
for i in $commands; do
echo "$i" "`$i getnumber`"
done
so this will loop through all files that starts with command and run command* getnumber for each one, which will print out the numbers we need.
Now we need to store these values in a file to compare them next time we run the command.
Catch:
We may even run the script every few minutes but we only need to report if the values (numbers) hasn't changed in the last hour.
The script will list the numbers every time you run it and we may add a styling to those who aren't changed in the last hour to pop them out for the eyes, maybe like adding a red color to them?
Attempt number #1
So this is my first attempt building this script. Here's what it looks like;
#!/bin/bash
commands=`find /usr/local/bin/ -name 'command*'`
date=`date +%s`
while read -r command number unixtime; do
for i in $commands; do
current_block_count=`$i getnumber`
if [[ $command = $i ]]; then
echo "$i exists in the file, checking the number changes within last hour" # just for debugging, will be removed in production
if (( ($date-$unixtime)/60000 > 60 )); then
if (( $number >= $current_number_count )); then
echo "There isn't a change within the last hour, this is a problem!" # just for debugging, will be removed in production
echo -e "$i" "`$i getnumber`" "/" "$number" "\e[31m< No change within last hour."
else
echo "$i" "`$i getnumber`"
echo "There's a change within the last hour, we're good." # just for debugging, will be removed in production
# find the line number of $i so we can change it with the new output
line_number=`grep -Fn '$i' outputs.log`
new_output=`$i getnumber`
sed -i "$line_numbers/.*/$new_output/" outputs.log
fi
else
echo "$i" "`$i getnumber`"
# find the line number of $i so we can change it with the new output
line_number=`grep -Fn '$i' outputs.log`
output_check="$i getnumber; date +%s"
new_output=`eval ${output_check}`
sed -i "$line_numbers/.*/$new_output/" outputs.log
fi
else
echo "$i does not exists in the file, adding it now" # just for debugging, will be removed in production
echo "$i" "`$i getnumber`" "`date +%s`" >> outputs.log
fi
done
done < outputs.log
Which was a quite the disaster and eventually, it did nothing when I've run it.
Attempt number #2
This time, I've tried another approach nesting for loop outside of the while loop.
#!/bin/bash
commands=`find /usr/local/bin/ -name 'command*'`
date=`date +%s`
for i in $commands; do
echo "${i}" "`$i getnumber`"
name=${i}
number=`$i getnumber`
unixtime=$date
echo "$name" "$number" "$unixtime" # just for debugging, will be removed in production
while read -r command number unixtime; do
if ! [ -z ${name+x} ]; then
echo "$name" "$number" "$unix" >> outputs.log
else
if [[ $name = $i ]]; then
if (( ($date-$unixtime)/60000 > 60 )); then
if (( $number >= $current_number_count )); then
echo "There isn't a change within the last hour, this is a problem!" # just for debugging, will be removed in production
echo -e "$i" "`$i getnumber`" "/" "$number" "\e[31m< No change within last hour."
else
echo "$i" "`$i getnumber`"
echo "There's a change within the last hour, we're good." # just for debugging, will be removed in production
# find the line number of $i so we can change it with the new output
line_number=`grep -Fn '$i' outputs.log`
new_output=`$i getnumber`
sed -i "$line_numbers/.*/$new_output/" outputs.log
fi
else
echo "$i" "`$i getnumber`"
# find the line number of $i so we can change it with the new output
line_number=`grep -Fn '$i' outputs.log`
output_check="$i getnumber; date +%s"
new_output=`eval ${output_check}`
sed -i "$line_numbers/.*/$new_output/" outputs.log
fi
else
echo "$i does not exists in the file, adding it now" # just for debugging, will be removed in production
echo "$i" "`$i getnumber`" "`date +%s`" >> outputs.log
fi
fi
done < outputs.log
done
Unfortunately, no luck for me, again.
Can someone give me a helping hand?
Additional notes #2
So basically, you run the script first time, outputs.log is empty, so you write the outputs of commands into outputs.log.
And it's been 10 minutes passed, you run the script again, since it's only 10 minutes passed and not more than an hour, the script won't check if the numbers have changed or not. It will not manipulate the stored values but also display us the outputs of command every time you run it. (Their present outputs and not from the stored values)
In this 10 minutes timeframe, for example, there might have been new commands added so it will check if the commands' outputs are stored every time you run the script, just to deal with new commands.
Now it's been, let's say 1.2 hours passed, you decided to run the script again, this time the script will check if the numbers hasn't changed after more than an hour and report us saying that Hey! It's been more than an hour passed and those numbers still haven't changed, there might be problem!
Simple explanation
You have 100 commands to run, your script will loop through each of them and do the followings for each;
Run the script whenever you want
On each run, check if outputs.log contains the command
If outputs.log contains the commands of each loop, check the last stored date ($unixtime) of each of them.
If last stored date is more than an hour, check the numbers between the current run and the stored value
If the numbers haven't changed for more than an hour, run the command in red text color.
If the numbers have changed, run the command as usual without any warning.
If last stored date is less than an hour, run the command as usual.
If outputs.log doesn't contain the command, simply store them in the file so it can be used for next runs to check.
The following uses a sqlite database to store results, instead of a flat file, which makes querying the history of previous runs easy:
#!/bin/sh
database=tracker.db
if [ ! -e "$database" ]; then
sqlite3 -batch "$database" <<EOF
CREATE TABLE IF NOT EXISTS outputs(command TEXT
, output INTEGER
, ts INTEGER NOT NULL DEFAULT (strftime('%s', 'now')));
CREATE INDEX IF NOT EXISTS outputs_idx ON outputs(command, ts);
EOF
fi
for cmd in /usr/local/bin/command*; do
f=$(basename "$cmd")
o=$("$cmd")
echo "$f $o"
sqlite3 -batch "$database" <<EOF
INSERT INTO outputs(command, output) VALUES ('$f', $o);
SELECT command || ' has unchanged output!'
FROM outputs
WHERE command = '$f' AND ts >= strftime('%s', 'now', '-1 hour')
GROUP BY command
HAVING count(DISTINCT output) = 1 AND count(*) > 1;
EOF
done
It lists commands that have had every run in the last hour produce the same output (and skips commands that have only run once). If instead you're interested in cases where the most recent output of each command is the same as the previous run in that hour timeframe, replace the sqlite3 invocation in the loop with:
sqlite3 -batch $database <<EOF
INSERT INTO outputs(command, output) VALUES ('$f', $o);
WITH prevs AS
(SELECT command
, output
, row_number() OVER w AS rn
, lead(output, 1) OVER w AS prev
FROM outputs
WHERE command = '$f' AND ts >= strftime('%s', 'now', '-1 hour')
WINDOW w AS (ORDER BY ts DESC))
SELECT command || ' has unchanged output!'
FROM prevs
WHERE output = prev AND rn = 1;
EOF
(This requires the sqlite3 shell from release 3.25 or newer because it uses features introduced then.)

importing data from a CSV in Bash

I have a CSV file that I need to use in a bash script. The CSV is formatted like so.
server1,file.name
server1,otherfile.name
server2,file.name
server3,file.name
I need to be able to pull this information into either an array or in some other way so that I can then filter the information and only pull out data for a single server that i can then pass to another command within the script.
I need it to go something like this.
Import workfile.csv
check hostname | return only lines from workfile.csv that have the hostname as column one and store column 2 as a variable.
find / -xdev -type f -perm -002 | compare to stored info | chmod o-w all files not in listing
I'm stuck using bash because of the environment that I'm working in.
The csv can be to big for adding all filenames in the find parameter list.
You also do not want to call find in a loop for every line in the csv.
Solution:
First make a complete list of files in a tmp file.
Second parse the csv and filter the files.
Third is chmod -w.
The next solution stores the files in a tmp
Make a script that gets the servername as a parameter.
See comment in the code:
# Before EDIT:
# Hostname by parameter 1
# Check that you have a hostname
if [ $# -ne 1 ]; then
echo "Usage: $0 hostname"
# Exit script, failure
exit 1
fi
hostname=$1
# Edit, get hostname by system call
hostname=$(hostname)
# Or: hostname=$(hostname -s)
# Additional check
if [ ! -f workfile.csv ]; then
echo "inputfile missing"
exit 1
fi
# After edits, ${hostname} is now filled.
find / -xdev -type f -perm -002 -name "${file}" > /tmp/allfiles.tmp
# Do not use cat workfile.csv | grep ..., you do not need to call cat
# grep with ^ for beginning of line, add a , for a complete first field
# grep "^${hostname}," workfile.csv
# cut for selecting second field with delimiter ','
# cut -d"," -f2
# while read file => can be improved with xargs but lets start with this.
grep "^${hostname}," workfile.csv | cut -d"," -f2 | while read file; do
# Using sed with #, not /, since you need / in the search string
# Variable in sed mist be outside the single quotes and in double quotes
# Add $ after the file for end-of-line
# delete the line with the file (#searchstring#d)
sed -i '#/'"${file}"'$#d' /tmp/allfiles.tmp
done
echo "Review /tmp/allfiles.tmp before chmodding all these files"
echo "Delete the echo and exit when you are happy"
# Just an exit for testing
exit
# Using < is for avoiding a call to cat
</tmp/allfiles.tmp xargs chmod -w
It might be easier when you can chmod -w all the files and chmod +w the files in the csv. This is a little different than you asked, since all files from the csv are writable after this process, maybe you do not want that.

SGE array jobs and R

I currently have a R script written to perform a population genetic simulation, then write a table with my results to a text file. I would like to somehow run multiple instances of this script in parallel using an array job (my University's cluster uses SGE), and when its all done I will have generated results files corresponding to each job (Results_1.txt, Results_2.txt, etc.).
Spent the better part of the afternoon reading and trying to figure out how to do this, but haven't really found anything along the lines of what I am trying to do. I was wondering if someone could provide and example or perhaps point me in the direction of something I could read to help with this.
To boil down mithrado's answer to the bare essentials:
Create job script, pop_gen.bash, that may or may not take SGE task id argument as input, storing results in specific file identified by same SGE task id:
#!/bin/bash
Rscript pop_gen.R ${SGE_TASK_ID} > Results_${SGE_TASK_ID}.txt
Submit this script as a job array, e.g. 1000 jobs:
qsub -t 1-1000 pop_gen.bash
Grid Engine will execute pop_gen.bash 1000 times, each time setting SGE_TASK_ID to value ranging from 1-1000.
Additionally, as mentioned above, via passing SGE_TASK_ID as command line variable to pop_gen.R you can use SGE_TASK_ID to write to output file:
args <- commandArgs(trailingOnly = TRUE)
out.file <- paste("Results_", args[1], ".txt", sep="")
# d <- "some data frame"
write.table(d, file=out.file)
HTH
I am not used to do this in R, but I've been using the same approach in python. Imagine that you have an script genetic_simulation.r and it has 3 parameter:
--gene_id --khmer_len and --output_file.
You will have one csv file, genetic_sim_parms.csv with n rows:
first_gene,10,/result/first_gene.txt
...
nth_gene,6,/result/nth_gene.txt
A import detail is the first lane of your genetic_simulation.r. It needs to tell which executable the cluster is going to will use. You might need to tweak its parameters as well, depending on your setup, it will look like to:
#!/path/to/Rscript --vanilla
And finally, you will need a array-job bash script:
#!/bin/bash
#$ -t 1:N < change to number of rows in genetic_sim_parms.csv
#$ -N genetic_simulation.r
echo "Starting on : $(date)"
echo "Running on node : $(hostname)"
echo "Current directory : $(pwd)"
echo "Current job ID : $JOB_ID"
echo "Current job name : $JOB_NAME"
echo "Task index number : $SGE_TASK_ID"
ID=$(awk -F, -v "line=$SGE_TASK_ID" 'NR==line {print $1}' genetic_sim_parms.csv)
LEN=$(awk -F, -v "line=$SGE_TASK_ID" 'NR==line {print $2}' genetic_sim_parms.csv)
OUTPUT=$(awk -F, -v "line=$SGE_TASK_ID" 'NR==line {print $3}' genetic_sim_parms.csv)
echo "id is: $ID"
rscript genetic_simulation.r --gene_id $ID --khmer_len $LEN --output_file $OUTPUT
echo "Finished on : $(date)"
Hope this helps!

Resources