Text specification for a tree of files? - file

I'm looking for examples of specifying files in a tree structure, for example, for specifying the set of files to search in a grep tool. I'd like to be able to include and exclude files and directories by name matches. I'm sure there are examples out there, but I'm having a hard time finding them.
Here's an example of a possible syntax:
*.py *.html
*.txt *.js
-*.pyc
-.svn/
-*combo_*.js
(this would mean include file with extensions .py .html .txt .js, exclude .pyc files, anything under a .svn directory, and any file matching combo_.js)
I know I've seen these sorts of specifications in other tools before. Is this ringing any bells for anyone?

There is no single standard format for this kind of thing, but if you want to copy something that is widely recognized, have a look at the rsync documentation. Look at the chapter on "INCLUDE/EXCLUDE PATTERN RULES."

Apache Ant provides 'ant globs or patterns where:
**/foo/**/*.java
means "any file ending in '.java' in a directory which includes a directory named 'foo' in its path" -- including ./foo/X.java

In your example syntax, is it implicitly understood that there's an escaping character so that you can explicitly include a file that begins with a dash? (The same question goes for any other wildcard characters, but I suppose I'd expect to see more files with dashes in their names than asterisks.)
Various command shells use * (and possibly ? to match a single char), as in your example, but they generally only match against a string of characters that doesn't include a path component separator (i.e. '\' on Windows systems, '/' elsewhere). I've also seen such source control apps as Perforce use additional patterns that can match against path component separators. For instance, with Perforce the pattern "foo/...ext" (without quotes) will match all files under the foo/ directory structure that end with "ext", regardless of whether they are in foo/ itself or in one of its descendant directories. This seems to be a useful pattern.

If you're using bash, you can use the extglob extension to get some nice globbing functions. Enable it as follows:
shopt -s extglob
Then you can do things like the following:
# everything but .html, .jpg or ,gif files
ls -d !(*.html|*gif|*jpg)
# list file9, file22 but not fileit
ls file+([0-9])
# begins with apl or un only
ls -d +(apl*|un*)
See also this page.

How about find in unixish environments?
Find can, of course, do more than build a list of files, but that is one of the common ways it is used. From the man page:
NAME
find -- walk a file hierarchy
SYNOPSIS
find [-H | -L | -P] [-EXdsx] [-f pathname] pathname ... expression
find [-H | -L | -P] [-EXdsx] -f pathname [pathname ...] expression
DESCRIPTION
The find utility recursively descends the directory tree for each
pathname listed, evaluating an expression (composed of the
primaries''
andoperands'' listed below) in terms of each file in the tree.
to achieve your goal I would write something like (formatted for readability):
find ./ \( -name *.{py,html,txt,js,pyc} -or \
-name *combo_*.js -or \
\( -name *.svn -and -type d\)\) \
-print
Moreover there is a idomatic pattern using xargs which makes find suitable for sending the whole list so constructed to an arbitrary command as in:
find /path -type f -print0 | xargs -0 rm

find(1) is a fine tool as described in the previous answer but if it gets more complicated, you should consider either writing your own script in any of the usual suspects (Ruby, Perl, Python et al.) or try to use one of the more powerful shells such as zsh which has a ** globbing commands and you can specify things to exclude. The latter is probably more complicated though.

You might want to check out ack, which allows you to specify file types to search in with options like --perl, etc.
It also ignores .svn directories by default, as well as core dumps, editor cruft, binary files, and so on.

Related

How to cat similar named sequence files from different directories into single large fasta file

I am trying to get the following done. I have circa 40 directories of different species, each with 100s of sequence files that contain orthologous sequences. The sequence files are similarly named for each of the species directories. I want to concatenate the identically named files of the 40 species directories into a single sequence file which is named similarly.
My data looks as follows, e.g.:
directories: Species1 Species2 Species3
Within directory (similar for all): sequenceA.fasta sequenceB.fasta sequenceC.fasta
I want to get single files named: sequenceA.fasta sequenceB.fasta sequenceC.fasta
where the content of the different files from the different species is concatenated.
I tried to solve this with a loop (but this never ends well with me!):
ls . | while read FILE; do cat ./*/"$FILE" >> ./final/"$FILE"; done
This resulted in empty files and errors. I did try to find a solution elsewhere, e.g.: (https://www.unix.com/unix-for-dummies-questions-and-answers/249952-cat-multiple-files-according-file-name.html, https://unix.stackexchange.com/questions/424204/how-to-combine-multiple-files-with-similar-names-in-different-folders-by-using-u) but I have been unable to edit them to my case.
Could anyone give me some help here? Thanks!
In a root directory where your species directories reside, you should run the following:
$ mkdir output
$ find Species* -type f -name "*.fasta" -exec sh -c 'cat {} >> output/`basename {}`' \;
It traverses all the files recursively and merges the contents of files with identical basename into one under output directory.
EDIT: even though this was an accepted answer, in a comment the OP mentioned that the real directories don't match a common pattern Species* as shown in the original question. In this case you can use this:
$ find -type f -not -path "./output/*" -name "*.fasta" -exec sh -c 'cat {} >> output/`basename {}`' \;
This way, we don't specify the search pattern but rather explicitly omit output directory to avoid duplicates of already processed data.

Error when trying to recursively add file extension to all files

Referring to this post, recursively add file extension to all files, I am trying to recursively add extensions to many files within many separate subfolders. All of the files appearing at the end of my subfolders do not have any extension at all, and I would like to give them all a .html extension.
I have tried the following in my command prompt after using cd to change to the parent directory that I would like to use:
find /path -type f -not -name "*.*" -exec mv "{}" "{}".html \;
However, I receive the following error: "FIND: Invalid switch"
I am new to using the command prompt for this type of manipulation, so please excuse my ignorance. I am thinking that maybe I have to change the /path to the directory I want it to look through, but I tried that to no avail.
I have also tried the following command:
find . -type f -exec mv '{}' '{}'.html \;
and receive the following error: FIND: Parameter format not correct
I am running Windows 10.
Seems like -not isn't available in your find version, use ! instead:
find /path -type f \! -name "*.*" -exec mv "{}" "{}".html \;
From find manual:
-not expr
Same as ! expr, but not POSIX compliant.
-not is an archaic form of logical negation; the current form is ! (exclamation). It has to be followed by a boolean expression. In this case, you followed it with -name, which fouled the command line parsing. -name is another option, not a valid expression operator.
You need to build the negation within your regular expression: negate the period, not the entire name.
I see another strong indicator: what is FIND? The command you supposedly ran is find; UNIX is case-significant. At whatever command line you're using, type man find or find --help to get a list of options and semantics. I'm worried that the bash you have under Windows isn't full-featured.
Are you familiar with the Windows command rename? It has a syntax similar to the UNIX mv, although it will work on multiple files. For instance
rename [^.]* *.html
I think would work for a single directory.
Apologies to all who commented and left answers. I believe I was unclear that I was trying to use this specifically from the windows cmd prompt. I used the following to add extensions to all files at the end of my subfolders:
FOR /R %f IN (*.) DO REN "%f" *.html

Moving things in terminal based on their name

Edit: I think this has been answered successfully, but I can't check 'til later. I've reformatted it as suggested though.
The question: I have a series of files, each with a name of the form XXXXNAME, where XXXX is some number. I want to move them all to separate folders called XXXX and have them called NAME. I can do this manually, but I was hoping that by naming them XXXXNAME there'd be some way I could tell Terminal (I think that's the right name, but not really sure) to move them there. Something like
mv *NAME */NAME
but where it takes whatever * was in the first case and regurgitates it to the path.
This is on some form of Linux, with a bash shell.
In the real life case, the files are 0000GNUmakefile, with sequential numbering. I'm having to make lots of similar-but-slightly-altered versions of a program to compile and run on a cluster as part of my research. It would probably have been quicker to write a program to edit all the files and put in the right place in the first place, but I didn't.
This is probably extremely simple, and I should be able to find an answer myself, if I knew the right words. Thing is, I have no formal training in programming, so I don't know what to call things to search for them. So hopefully this will result in me getting an answer, and maybe knowing how to find out the answer for similar things myself next time. With the basic programming I've picked up, I'm sure I could write a program to do this for me, but I'm hoping there's a simple way to do it just using functionality already in Terminal. I probably shouldn't be allowed to play with these things.
Thanks for any help! I can actually program in C and Python a fair amount, but that's through trial and error largely, and I still don't know what I can do and can't do in Terminal.
SO many ways to achieve this.
I find that the old standbys sed and awk are often the most powerful.
ls | sed -rne 's:^([0-9]{4})(NAME)$:mv -iv & \1/\2:p'
If you're satisfied that the commands look right, pipe the command line through a shell:
ls | sed -rne 's:^([0-9]{4})(NAME)$:mv -iv & \1/\2:p' | sh
I put NAME in brackets and used \2 so that if it varies more than your example indicates, you can come up with a regular expression to handle your filenames better.
To do the same thing in gawk (GNU awk, the variant found in most GNU/Linux distros):
ls | gawk '/^[0-9]{4}NAME$/ {printf("mv -iv %s %s/%s\n", $1, substr($0,0,4), substr($0,5))}'
As with the first sample, this produces commands which, if they make sense to you, can be piped through a shell by appending | sh to the end of the line.
Note that with all these mv commands, I've added the -i and -v options. This is for your protection. Read the man page for mv (by typing man mv in your Linux terminal) to see if you should be comfortable leaving them out.
Also, I'm assuming with these lines that all your directories already exist. You didn't mention if they do. If they don't, here's a one-liner to create the directories.
ls | sed -rne 's:^([0-9]{4})(NAME)$:mkdir -p \1:p' | sort -u
As with the others, append | sh to run the commands.
I should mention that it is generally recommended to use constructs like for (in Tim's answer) or find instead of parsing the output of ls. That said, when your filename format is as simple as /[0-9]{4}word/, I find the quick sed one-liner to be the way to go.
Lastly, if by NAME you actually mean "any string of characters" rather than the literal string "NAME", then in all my examples above, replace NAME with .*.
The following script will do this for you. Copy the script into a file on the remote machine (we'll call it sortfiles.sh).
#!/bin/bash
# Get all files in current directory having names XXXXsomename, where X is an integer
files=$(find . -name '[0-9][0-9][0-9][0-9]*')
# Build a list of the XXXX patterns found in the list of files
dirs=
for name in ${files}; do
dirs="${dirs} $(echo ${name} | cut -c 3-6)"
done
# Remove redundant entries from the list of XXXX patterns
dirs=$(echo ${dirs} | uniq)
# Create any XXXX directories that are not already present
for name in ${dirs}; do
if [[ ! -d ${name} ]]; then
mkdir ${name}
fi
done
# Move each of the XXXXsomename files to the appropriate directory
for name in ${files}; do
mv ${name} $(echo ${name} | cut -c 3-6)
done
# Return from script with normal status
exit 0
From the command line, do chmod +x sortfiles.sh
Execute the script with ./sortfiles.sh
Just open the Terminal application, cd into the directory that contains the files you want moved/renamed, and copy and paste these commands into the command line.
for file in [0-9][0-9][0-9][0-9]*; do
dirName="${file%%*([^0-9])}"
mkdir -p "$dirName"
mv "$file" "$dirName/${file##*([0-9])}"
done
This assumes all the files that you want to rename and move are in the same directory. The file globbing also assumes that there are at least four digits at the start of the filename. If there are more than four numbers, it will still be caught, but not if there are less than four. If there are less than four, take off the appropriate number of [0-9]s from the first line.
It does not handle the case where "NAME" (i.e. the name of the new file you want) starts with a number.
See this site for more information about string manipulation in bash.

how to find all the labels for a given file in clearcase

I know one awkward solution for this taks will be :
first use ct ls to get the entire version info of the file
and pipe the version info to a parsing script to actually get the labels of the file .
But I guess ClearCase should have a "build in" solution for this task without support from any external scripts.
Please help me if you happen to know a "build in" solution for the task.
Thanks in advance.
fmt_ccase contains all the format-string for various ClearCase elements.
For a version of a file, you can:
cleartool descr -fmt "%l\n" /path/to/a/version
%l
Labels: For versions, all attached labels; the null string otherwise.
Labels are output as a comma-separated list, enclosed in parentheses.
A <SPACE> character follows each comma.
Variants:
%Cl
Max labels: Specify the maximum number of labels to display with the max-field-width parameter (see Specifying field width).
If there are more labels, "..." is appended to the output.
If no max-field-width is specified, the maximum default value is 3.
%Nl
No commas: Suppress the parentheses and commas in label list output;
separate labels with spaces only.
So the result can be:
Labels: (Rel3.1C, Rel3.1D, Rel3.1E)
Labels without commas or parens: Rel3.1C Rel3.1D Rel3.1E
In both case, you still need to parse the result, but at least the output can contain only the labels, as in:
Rel3.1C Rel3.1D Rel3.1E
onaclov2000 adds (from the comments):
The only problem with this is that you are grabbing the label on the specific version of the file.
Given that branches etc can exist, we'll need to be able to get ALL labels on a file.
If you use version tree graphical and select tools -> "locate" you can see ALL the labels attached to that file.
Is there a common command in cleartool that will return the results of "locate", or "contents"?
The lsvtree (graphical version tree) does display the labels of all the versions of the element currently seen by the view when you click "Label Name"
That being said, there does not seem to be a "built-in" solution and some parsing is involved:
For instance (which is a bit shorter than the OP version but still based on a cleartool ls):
ct ls -l addon.xml##|grep version|gawk "{gsub(/^version.*##\\\\/,\"\",$0) ; gsub(/ \ [.*/,\"\",$0); print $0}"
(GnuWin32 syntax)
or, only with a dynamic view:
cd m:/myView/path/to/addon.xml##
# list all files, not directories: the files are the labels
dir /B /A-D
The IBM article "Additional examples of the cleartool find command" is a great source for find query.
To expand on the "lsvtree" bit mentioned by VonC in his answer, you have:
To find all elements with any label:
Windows:
cleartool find . -type f -exec "cleartool lsvtree -a %CLEARCASE_PN%" | findstr
"("
./hello.c##/main/1 (LABEL100, LABEL99, LABEL98, LABEL97)
./foo.xml##/main/BR1/1 (REL2)
./bar.o##/main/1 (REL1)
UNIX/Linux:
cleartool find . -type f -exec 'cleartool lsvtree -a $CLEARCASE_PN' | grep "("
./hello.c##/main/1 (LABEL100, LABEL99, LABEL98, LABEL97)
./foo.xml##/main/BR1/1 (REL2)
./bar.o##/main/1 (REL1)
That finds only labels for versions currently selected in the view, but you could reuse the lsvtree part to grep all versions of a file with labels.

In ClearCase, how can I view old version of a file in a static view, from the command line?

In a static view, how can I view an old version of a file?
Given an empty file (called empty in this example) I can subvert diff to show me the old version:
% cleartool diff -ser empty File##/main/28
This feels like a pretty ugly hack. Have I missed a more basic command? Is there a neater way to do this?
(I don't want to edit the config spec - that's pretty tedious, and I'm trying to look at a bunch of old versions.)
Clarification: I want to send the version of the file to stdout, so I can use it with the rest of Unix (grep, sed, and so on.) If you found this question because you're looking for a way to save a version of an element to a file, see Brian's answer.
I'm trying to look at a bunch of old versions
I am not sure if you are speaking about "a bunch of old versions" of one file, "a bunch of old versions" from several files.
To visualize several old versions of one file, the simplest mean is to display its version tree (ct lsvtree -graph File), and then select a version, right-click on it and 'Send To' an editor which accepts multiple files (like Notepad++). In a few click you will have a view of those old versions.
Note: you must have CC6.0 or 7.0.1 IFix01 (7.0.0 and 7.0.1 fail to 'sent to' a file with the following error message "Access to unnamed file was denied")
But to visualize several old versions of different files, I would recommend a dynamic view and editing the config spec of that view (and not the snapshot view you are currently working with), in order to quickly select all those old files (hopefully through a simple select rule like 'element * aLabel')
[From the comments:]
what's the idiomatic way to "cat" an earlier revision of a file?
The idiomatic way is through a dynamic view (that you configure with the exact same config spec than your existing snapshot view).
You can then browse (as in 'change directory to') the various extended paths of a file.
If you want to cat all versions of a branch of a file, you go in:
cd /view/MyView/vobs/myVobs/myPath/myFile##/main/[...]/maBranch
cat 1
cat 2
...
cat x
'1', '2', ... 'x' being the version 1, 2, ... x of your file within that branch.
For a snapshot view, the extended path is not accessible, so your "hack" is the way to go.
However, 2 remarks here:
to quickly display all previous revisions of a snapshot file in a given branch, you can type:
(one line version for copy-paste, Unix syntax:)
cleartool find addon.xml -ver 'brtype(aBranch) && !version(.../aBranch/LATEST) && ! version(.../aBranch/0)' -exec 'cleartool diff -ser empty "$CLEARCASE_XPN"'
(multi-line version for readability:)
cleartool find addon.xml -ver 'brtype(aBranch) &&
!version(.../aBranch/LATEST) &&
! version(.../aBranch/0)'
-exec 'cleartool diff -ser empty "$CLEARCASE_XPN"'
you can quickly have an output a little nicer with
(one line version for copy-paste, Unix syntax:)
cleartool find addon.xml -ver 'brtype(aBranch) && !version(.../aBranch/LATEST) && ! version(.../aBranch/0)' -exec 'cleartool diff -ser empty "$CLEARCASE_XPN"' | ccperl -nle '$a=$_; $b = $a; $b =~ s/^>+\s(?:file\s+\d+:\s+)?//g;print $b if $a =~/^>/'
(multi-line version for readability:)
cleartool find addon.xml -ver 'brtype(aBranch) &&
!version(.../aBranch/LATEST) &&
! version(.../aBranch/0)'
-exec 'cleartool diff -ser empty "$CLEARCASE_XPN"'
| ccperl -nle '$a=$_; $b = $a;
$b =~ s/^>+\s(?:file\s+\d+:\s+)?//g;
print $b if $a =~/^>/'
That way, the output is nicer.
The "cleartool get" command (man page) mentioned below by Brian don't do stdout:
The get command copies only file elements into a view.
On a UNIX or Linux system, copy /dev/hello_world/foo.c##/main/2 into the current directory.
cmd-context get –to foo.c.temp /dev/hello_world/foo.c##/main/2
On a Windows system, copy \dev\hello_world\foo.c##\main\2 into the C:\build directory.
cmd-context get –to C:\build\foo.c.temp \dev\hello_world\foo.c##\main\2
So maybe than, by piping the result to a cat (or type in windows), you can then do something with the output of said cat (type) command.
cmd-context get –to C:\build\foo.c.temp \dev\hello_world\foo.c##\main\2 | type C:\build\foo.c.temp
I know this is an old thread...but I couldn't let this thrashing go by unresolved....
Static views have a "ct get" command that does exactly what you are looking for.
cleartool get -to ~/foo File##/main/28
will save this version of the file in ~/foo.
[ Rewritten based on the first comment ]
All files in Clearcase, including versions, are available in the virtual directory structure. I don't have a lot of familiarity with static views, but I believe they still go through a virtual fs; they just get updated differently.
In that case, you can just do:
cat File##/main/28
It can get ugly if you also have to find the right version of a directory that contained that file element. We have a PERL script at work that uses this approach to analyze historical changes made to files, and we quickly ran out of command-line space on Windows to actually run the commands!
If File is a Clearcase element, and cat File works, and the view is set correctly, then try:
cat File##/main/28
(note: without the ct shell-- you shouldn't need this if you're already in the view.)
Try typing:
ct ls -l File
If it shows the file with an extended name similar to the above, then you should be able to cat the file using an extended name.
ct shell cat File##version

Resources