SGE array jobs with multiple inputs

SGE array jobs with multiple inputs - arrays

So, I would like to get some help creating a shell script that will allow me to submit an array job where each individual job has multiple input files. An example of how I run array jobs that have one input per job is as follows:
DIR=/WhereMyFilesAre
LIST=($DIR/*fastq) #files I want to process
INDEX=$((SGE_TASK_ID-1))
INPUT_FILE=${LIST[$INDEX]}
bwa aln ${DIR}/referencegenome.fasta $INPUT_FILE > ${INPUT_FILE%.fastq}.sai
So, basically what I want to do is something similar, except if I had 2 or more lists of files instead of one. And those files need to be paired properly. For instance, if I had File1_A.txt, File1_B.txt, File2_A.txt, File2_B.txt, and something that looked generically like
program input1 input2 > output
I would want the resulting jobs to have lines that look like
program File1_A.txt File1_B.txt > File1.txt
program File2_A.txt File2_B.txt > File2.txt

As you specify, if two input files are of fixed naming nomenclature except for the $INDEX then just use SGE_TASK_ID as INDEX in your job script:
program File${SGE_TASK_ID}_A.txt File${SGE_TASK_ID}_B.txt > File${SGE_TASK_ID}.txt

Related

Run TreSpEx analysis

I am trying to run TreSpex analysis on a series of trees, which are saved in newick format as .fasta.txt files in a folder.
I have a list of Taxa names saved in a .txt file
I enter:
perl TreSpEx.v1.pl -fun e -ipt *fasta.txt -tf Taxa_List.txt
But it won't run. I tried writing a loop for each file within the folder but am not very good with them and my line of
for i in treefile/; do perl TreSpEx.v1.1.pl -fun e -ipt *.fasta.txt -tf Taxa_List.txt; done
won't work because -ipt apparently needs a name that starts with a letter or number

In your second example you are actually doing the same thing as in first (but posible several times).
I'm not familiar with TreSpEx or know Bash very well for that matter (which it seems you are using), but you might try something like below.
for i in treefile/*.fasta.txt ; do
perl TreSpEx.v1.1.pl -fun e -ipt $i -tf Taxa_List.txt;
done
Basically, you need to use a variable from the for loop (i) to pass name of each file to the command.

Extracting and comparing only a certain column of a file

I need to write a PowerShell script, that allows the user to pass a txt file, that contains the standard information you'd get from a
Get-Process > proc.txt
statement as a parameter and then compare the processes in the file with the currently running ones. I then need to display Id, Name, Starting time and running time for every process, that isn't in the txt file and therefore a new process.
To give you a general idea of how I would approach this: I would
Extract only the names of the processes from the txt file into a variable (v1).
Save only the names of all the currently running processes in a variable (v2).
Compare the 2 variables(v1, v2) and write the processes that are not in the txt file (the new ones) into yet another variable (v3).
Get the process ID, the starting time and the running time for each process name in v3 and output all of that (including name) into the console and in a new file.
First of all, how can I only read the names of the processes from the txt file? I tried to find it on the internet but had no luck.
Secondly how can I save only the new processes in a variable and not all the differences (e.g. processes that are in the file but currently not running).
As far as I know,
Compare-Object
returns all the differences.
Thirdly how can I get the remaining process information I want from all the process names in v3?
And finally how can I then neatly combine ID, starting time, running time and the names from v3 in one file?
I'm pretty much a beginner at PowerShell programming, I'm pretty sure my 4 step approach posted above is most likely wrong and therefore appreciate any help I can get.

Diff two files in shell script

Assume we have two files named file1 and file2.
File1:
a=b
c=d
e=f
File2:
a=p
c=o
e=f
g=h
i=j
Here file2 has the same keys of file1 and different values.Apart from some extra key-value pair of its own.
Compare two files keys, replace file2 value with file1 value based on key match. Retain the new entries in file2.
So, my final output should be :
File2:
a=b
c=d
e=f
g=h
i=j
Thanks In Advance.

Quickest way without using scripts is using the tool called "meld".
I can give one way of approaching the problem (though not the best)
1.read from first file line by line
2.split based on "=" expression
3.store the two variables as key and value
make an array of all key value pairs
4.read from the second file and repeat the procedure
compare the two arrays and save values not in first array only
In this specific case you can use "cut" command in shell to choose fields .
I personally prefer Perl script for file operations like this :)

When is a file created when using output redirection?

I have a script that runs in AIX ksh that looks like this:
wc dir1/* dir2/* | {awk command to rearrange output} | {grep command to filter more} > dir2/output.txt
It is a precondition to this line that dir2/output.txt does not exist.
The issue is that dir2/output.txt has contained itself in the output (it's happened a handful of times out of hundreds of times with no problem). dir1 and dir2 are NFS-mounted.
Is it related to the implementation of wc -- what if the the first parameter takes a long time? I think not, as I've tried the following:
wc `sleep 5` *.txt > out.txt
Even in this case out.txt does not list itself.
As a last note, wildcards are used in this example where they are used in the actual script. So if the expansion happens first, why does this problem occur?
At what point is dir2/output.txt actually created?

Redirections are done by the shell, as are globs. Your problem is that, in the case of a pipeline, each pipeline stage is a separate subprocess; whether the shell subprocess that does the final redirection runs before the one that builds the glob of input files for wc will depend on details of the scheduler and system load, among other things, and should be considered indeterminate.
In short, you should assume that this will happen and either exclude dir2/output.txt (take a look at ksh extended glob patterns; in particular, something along the lines of dir2/!(output.txt) may be useful) or create the output somewhere else and mv it to its final location afterward.

Unix : script as proxy to a file

Hi : Is there a way to create a file which, when read, is generated dynamically ?
I wanted to create 3 versions of the same file (one with 10 lines, one with 100 lines, one with all of the lines). Thus, I don't see any need for these to be static, but rather, it would be best if they were proxies from a head/tail/cat command.
The purpose of this is for unit testing - I want a unit test to run on a small portion of the full input file used in production. However, since the code only runs on full files (its actually a hadoop map/reduce application), I want to provide a truncated version of the whole data set without duplicating information.
UPDATE: An Example
more myActualFile.txt
1
2
3
4
5
more myProxyFile2.txt
1
2
more myProxyFile4.txt
1
2
3
4
etc.... So the proxy files are DIFFERENT named files with content that is dynamically provided by simply getting the first n lines of the main file.

This is hacky, but... One way is to use named pipes, and a looping shell script to generate the content (one per named pipe). This script would look like:
while true; do
(
for $(seq linenr); do echo something; done
) >thenamedpipe;
done
Your script would then read from that named pipe.
Another solution, if you are ready to dig into low level stuff, is FUSE.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight