How to cat similar named sequence files from different directories into single large fasta file - loops

I am trying to get the following done. I have circa 40 directories of different species, each with 100s of sequence files that contain orthologous sequences. The sequence files are similarly named for each of the species directories. I want to concatenate the identically named files of the 40 species directories into a single sequence file which is named similarly.
My data looks as follows, e.g.:
directories: Species1 Species2 Species3
Within directory (similar for all): sequenceA.fasta sequenceB.fasta sequenceC.fasta
I want to get single files named: sequenceA.fasta sequenceB.fasta sequenceC.fasta
where the content of the different files from the different species is concatenated.
I tried to solve this with a loop (but this never ends well with me!):
ls . | while read FILE; do cat ./*/"$FILE" >> ./final/"$FILE"; done
This resulted in empty files and errors. I did try to find a solution elsewhere, e.g.: (https://www.unix.com/unix-for-dummies-questions-and-answers/249952-cat-multiple-files-according-file-name.html, https://unix.stackexchange.com/questions/424204/how-to-combine-multiple-files-with-similar-names-in-different-folders-by-using-u) but I have been unable to edit them to my case.
Could anyone give me some help here? Thanks!

In a root directory where your species directories reside, you should run the following:
$ mkdir output
$ find Species* -type f -name "*.fasta" -exec sh -c 'cat {} >> output/`basename {}`' \;
It traverses all the files recursively and merges the contents of files with identical basename into one under output directory.
EDIT: even though this was an accepted answer, in a comment the OP mentioned that the real directories don't match a common pattern Species* as shown in the original question. In this case you can use this:
$ find -type f -not -path "./output/*" -name "*.fasta" -exec sh -c 'cat {} >> output/`basename {}`' \;
This way, we don't specify the search pattern but rather explicitly omit output directory to avoid duplicates of already processed data.

Related

find and rename multiple files on multiple folder

Finding a way to rename multiple files on a multiple folder
folder i.e. I have file called "jobsforms.html.bak" on a multiple folder under:
/home/sites/juk/jobsforms.html.bak
/home/sites/juan/jobsforms.html.bak
/home/sites/pedro/jobsforms.html.bak
/home/sites/luois/jobsforms.html.bak
I want to rename all the files found as: "jobsforms.html" how can I do that.
I was trying to do this aproach.
find /home/sites -name "jobsform.html.bak" -exec bash -c 'mv "$1" "${1%/*}"/jobsform.html' -- {} \;
Anyone can help me how to go about to do this.
Than you,
David
You could pipe the output of find to awk, using the sub function to remove a substring from the filename:
find /home/sites -name "jobsforms.html.bak" | awk '{ori=$0; sub(/\.bak$/,"",$0); system("mv \""ori"\" "$0)}'

Script for renameing special characters files and directories

I am looking for a script to rename files and directories that have special characters in them.
My files:
?rip?ev <- Directory
- Juhendid ?rip?evaks.doc <- Document
- ?rip?ev 2 <- Subdirectory
-- t?ts?.xml <- Subdirectory file
They need to be like this:
ripev <- Directory
- Juhendid ripevaks.doc <- Document
- ripev 2 <- Subdirectory
-- tts.xml <- Subdirectory file
I need to change the files and the folders so that the filetype stays the same as it is for example .doc and .xml wont be lost. Last time I did it with rename it lost every filetype and the files were moved to mother directory in this case ?rip?ev directory and subdirectories were empty. Everything was located under the mother directory /home/samba/.
So in this case I need just to rename the question mark in the file name and directory name, but not to move it anywhere else or lose any other character or the filetype. I have been looking around google for a answer but haven't found one. I know it can be done with find and rename, but haven't been able to over come the complexity of the script. Can anyone help me please?
You can just do something like this
find -name '*\?*' -exec bash -c 'echo mv -iv "$0" "${0//\?/}"' {} \;
Note the echo before the mv so you can see what it does before actually changing anything. Otherwise above:
searches for ? in the name (? is equivalent to a single char version of * so needs to be escaped)
executes a bash command passing the {} as the first argument (since there is no script name it's $0 instead of $1)
${0//\?/} performs parameter expansion in bash replacing all occurrences of ? with nothing.
Note also that file types do not depend on the name in linux, but this should not change any file extension unless they contain ?.
Also this will rename symlinks as well if they contain ? (not clear whether or not that was expected from question).
I usually do this kind of thing in Perl:
#!/usr/bin/perl
sub procdir {
chdir #_[0];
for (<*>) {
my $oldname = $_;
rename($oldname, $_) if s/\?//g;
procdir($_) if -d;
}
chdir "..";
}
procdir("top_directory");

Replace all files of a certain type in all directories and subdirectories?

I have played around with the find command and anything else I can think of but nothing will work.
I would like my bash script to be able to find all of a file type in a given directory and all of its subdirectories and replace the file with another.
EX: lets say
/home/test1/randomfolder/index.html
/home/test1/randomfolder/stuff.html
/home/different/stuff/index.html
/home/different/stuff/another.html
Each of those .html files need to be found when the program is given /home/ as a directory to search in, and then replaced by echoing the other file into them.
Is this possible in bash?
This should more or less get you going in the right direction:
for file in `find . -type f -name \*.html`; do echo "new content" > $file; done

symlink-copying a directory hierarchy

What's the simplest way on Linux to "copy" a directory hierarchy so that a new hierarchy of directories are created while all "files" are just symlinks pointing back to the actual files on the source hierarchy?
cp -s does not work recursively.
I just did a quick test on a linux box and cp -sR /orig /dest does exactly what you described: creates a directory hierarchy with symlinks for non-directories back to the original.
cp -as /root/absolute/path/name dest_dir
will do what you want. Note that the source name must be an absolute path, it cannot be relative. Else, you'll get this error: "xyz-file: can make relative symbolic links only in current directory."
Also, be careful as to what you're copying: if dest_dir already exists, you'll have to do something like:
cp -as /root/absolute/path/name/* dest_dir/
cp -as /root/absolute/path/name/.* dest_dir/
Starting from above the original & new directories, I think this pair of find(1) commands will do what you need:
find original -type d -exec mkdir new/{} \;
find original -type f -exec ln -s {} new/{} \;
The first instance sets up the directory structure by finding only directories in the original tree and recreating them in the new tree. The second creates the symlinks to the original files in the new tree.
There's also the "lndir" utility (from X) which does such a thing; I found it mentioned here: Debian Bug report #301030: can we move lndir to coreutils or debianutils? , and I'm now happily using it.
I googled around a little bit and found a command called lns, available from here.
If you feel like getting your hands dirty
Here is a trick that will automatically create the destination folder, subfolders and symlink all files recursively.
In the folder where the files you want to symlink and sub folders are:
create a file shell.sh:
nano shell.sh
copy and paste this charmer:
#!/bin/bash
export DESTINATION=/your/destination/folder/
export TARGET=/your/target/folder/
find . -type d -print0 | xargs -0 bash -c 'for DIR in "$#";
do
echo "${DESTINATION}${DIR}"
mkdir -p "${DESTINATION}${DIR}"
done' -
find . -type f -print0 | xargs -0 bash -c 'for file in "$#";
do
ln -s "${TARGET}${file}" "${DESTINATION}${file}"
done' -
save the file ctrl+O
close the file ctrl+X
Make your script executable chmod 777 shell.sh
Run your script ./shell.sh
Happy hacking!
I know the question was regarding shell, but since you can call perl from shell, I wrote a tool to do something very similar to this, and posted it on perlmonks a few years ago. In my case, I generally wanted directories to remain links until I decide otherwise. It'd be a fairly trivial change to do this automatically and recursively.

Text specification for a tree of files?

I'm looking for examples of specifying files in a tree structure, for example, for specifying the set of files to search in a grep tool. I'd like to be able to include and exclude files and directories by name matches. I'm sure there are examples out there, but I'm having a hard time finding them.
Here's an example of a possible syntax:
*.py *.html
*.txt *.js
-*.pyc
-.svn/
-*combo_*.js
(this would mean include file with extensions .py .html .txt .js, exclude .pyc files, anything under a .svn directory, and any file matching combo_.js)
I know I've seen these sorts of specifications in other tools before. Is this ringing any bells for anyone?
There is no single standard format for this kind of thing, but if you want to copy something that is widely recognized, have a look at the rsync documentation. Look at the chapter on "INCLUDE/EXCLUDE PATTERN RULES."
Apache Ant provides 'ant globs or patterns where:
**/foo/**/*.java
means "any file ending in '.java' in a directory which includes a directory named 'foo' in its path" -- including ./foo/X.java
In your example syntax, is it implicitly understood that there's an escaping character so that you can explicitly include a file that begins with a dash? (The same question goes for any other wildcard characters, but I suppose I'd expect to see more files with dashes in their names than asterisks.)
Various command shells use * (and possibly ? to match a single char), as in your example, but they generally only match against a string of characters that doesn't include a path component separator (i.e. '\' on Windows systems, '/' elsewhere). I've also seen such source control apps as Perforce use additional patterns that can match against path component separators. For instance, with Perforce the pattern "foo/...ext" (without quotes) will match all files under the foo/ directory structure that end with "ext", regardless of whether they are in foo/ itself or in one of its descendant directories. This seems to be a useful pattern.
If you're using bash, you can use the extglob extension to get some nice globbing functions. Enable it as follows:
shopt -s extglob
Then you can do things like the following:
# everything but .html, .jpg or ,gif files
ls -d !(*.html|*gif|*jpg)
# list file9, file22 but not fileit
ls file+([0-9])
# begins with apl or un only
ls -d +(apl*|un*)
See also this page.
How about find in unixish environments?
Find can, of course, do more than build a list of files, but that is one of the common ways it is used. From the man page:
NAME
find -- walk a file hierarchy
SYNOPSIS
find [-H | -L | -P] [-EXdsx] [-f pathname] pathname ... expression
find [-H | -L | -P] [-EXdsx] -f pathname [pathname ...] expression
DESCRIPTION
The find utility recursively descends the directory tree for each
pathname listed, evaluating an expression (composed of the
primaries''
andoperands'' listed below) in terms of each file in the tree.
to achieve your goal I would write something like (formatted for readability):
find ./ \( -name *.{py,html,txt,js,pyc} -or \
-name *combo_*.js -or \
\( -name *.svn -and -type d\)\) \
-print
Moreover there is a idomatic pattern using xargs which makes find suitable for sending the whole list so constructed to an arbitrary command as in:
find /path -type f -print0 | xargs -0 rm
find(1) is a fine tool as described in the previous answer but if it gets more complicated, you should consider either writing your own script in any of the usual suspects (Ruby, Perl, Python et al.) or try to use one of the more powerful shells such as zsh which has a ** globbing commands and you can specify things to exclude. The latter is probably more complicated though.
You might want to check out ack, which allows you to specify file types to search in with options like --perl, etc.
It also ignores .svn directories by default, as well as core dumps, editor cruft, binary files, and so on.

Resources