Concatenate 2 files from a list based on partial matching - loops

I have a list of files that need to be concatenated (or appended). I know how to do it one by one but I'm not sure how to loop through them. Bonus points for using Parallel to run them simultaneously.
For the result manually, I would run:
cat exp-2-2_S1_L001_R1_001.fastq >> Exp-2-2_reseq_S1_L001_R1_001.fastq
But there are hundreds of files. Also, can this be done on compressed files with zcat?
Example of some of the files that need to be concatenated or appended (in order):
with
Thanks in advance!!!

Related

Why Shake dependencies are explicitly `needed`?

I find first example of Shake usage demonstrating a pattern that seems error prone:
contents <- readFileLines $ out -<.> "txt"
need contents
cmd "tar -cf" [out] contents
Why do we need need contents when readFileLines reads them and cmd references them? Is this so we can avoid requiring ApplicativeDo?
I think part of the confusion may be the types/semantics of contents. The file out -<.> "txt" contains a list of filenames, so contents is a list of filenames. When we need contents we are requiring the files themselves be created and depended upon, using the filenames to specify which files. When we pass contents on to cmd we are passing the filenames which tar will use to query the files.
So the key point is that readFileLines doesn't read the files in question, it only reads the filenames out of another file. We have to use need to make sure that using the files is fine, and then we actually use the files in cmd. Another way of looking at the three lines is:
Which files do we want to operate on?
Make sure those files are ready.
Use those files.
Does that make sense? There's no relationship with ApplicativeDo - it's presence wouldn't help us at all.

Copying multiple files to a Winscp directory via Script

I have a problem when trying to upload multiple files to one WinSCP directory, i can manage to copy just one single file, but the problem is that i need to upload many files that are generated by a software, the names are not fixed ones, so i need to make use of wildcards in roder to copy all of them, i have tried many variants on the code, but it all was unsuccessful, the code i am using is:
open "sftp://myserver:MyPass#sfts.us.myserver.com" -hostkey="hostkey"
put "C:\from*.*" "/Myserverfolder/Subfolder/"
exit
This code does actually copy the first alphabetically named file, but it ignores the rest of the files.
Any help with it would be much appreciated
Try this in script
Lcd C:\from
Cd Myserverfolder/Subfolder
Put *
Try and do all manually first so you can see what's going on.

Looping through all the inputs and creating distinct output files

I am using cygwin on Windows 7. I have a folder with about 1200 files of the same type(There are no sub-directories). I am trying to go through the folder and perform a certain action(it's bioinformatic alignment) on each file. Here is the code that I am using:
$ for file in Project/genomes_0208/*;
do ./bowtie-build $file ../bowtie-0.12.7/indexes/abc;
done
./bowtie - build is the operation that I want to perform. Now, this does complete the operation for all the files in the folder,but it keeps writing the ouput in the same file abc in this case.
So in the end I have only 1 file with the latest output. How can I create 1200 different files , one for each of the input? It doesn't matter what I name the output files, it could be anything, as long as they are obviously different.
Hope I explained the problem successfully, I'd appreciate any help with this!
How about:
./bowtie-build $file "${file}.out"
If your files had unique names to begin with this should also produce unique output files.

Cat selected files fast and easy?

I have been cat'ing files in the Terminal untill now.. but that is time consuming when done alot. What I want is something like:
I have a folder with hundreds of files, and I want to effectively cat a few files together.
For example, is there a way to select (in the Finder) five split files;
file.txt.001, file.txt.002, file.txt.003, file.txt.004
.. and then right click on them in the Finder, and just click Merge?
I know that isn't possible out of the box of course, but with an Automator action, droplet or shell script, is something like that possible to do? Or maybe assigning that cat-action a keyboard shortcut, and when hit selected files in the Finder, will be automatically merged together to a new file AND placed in the same folder, WITH a name based on the original split files?
In this example file.001 through file.004 would magically appear in the same folder, as a file named fileMerged.txt ?
I have like a million of these kind of split files, so an efficient workflow for this would be a life saver. I'm working on an interactive book, and the publisher gave me this task..
cat * > output.file
works as a sh script. It's piping the contents of the files into that output.file.
* expands to all files in the directory.
Judging from your description of the file names you can automate that very easily with bash. e.g.
PREFIXES=`ls -1 | grep -o "^.*\." | uniq`
for PREFIX in $PREFIXES; do cat ${PREFIX}* > ${PREFIX}.all; done
This will merge all files in one directory that share the same prefix.
ls -1 lists all files in a directory (if it spans multiple directories can use find instead. grep -o "^.*\." will match everything up to the last dot in the file name (you could also use sed -e 's/.[0-9]*$/./' to remove the last digits. uniq will filter all duplicates. Then you have something like speech1.txt. sound1.txt. in the PREFIXES variable. The next line loops through those and merges the groups of files individually using the * wildcard.

Hadoop job taking input files from multiple directories

I have a situation where I have multiple (100+ of 2-3 MB each) files in compressed gz format present in multiple directories. For Example
A1/B1/C1/part-0000.gz
A2/B2/C2/part-0000.gz
A1/B1/C1/part-0001.gz
I have to feed all these files into one Map job. From what I see , for using MultipleFileInputFormat all input files need to be in same directory . Is it possible to pass multiple directories directly into the job?
If not , then is it possible to efficiently put these files into one directory without naming conflict or to merge these files into 1 single compressed gz file.
Note: I am using plain java to implement the Mapper and not using Pig or hadoop streaming.
Any help regarding the above issue will be deeply appreciated.
Thanks,
Ankit
FileInputFormat.addInputPaths() can take a comma separated list of multiple files, like
FileInputFormat.addInputPaths("foo/file1.gz,bar/file2.gz")

Resources