Perl: Rename all files in a directory - arrays

So I am brand new to Perl and this is my first program (excluding a few basic tutorials to get to grips with very basic syntax)
What I want to do is rename all files within a specified directory to "File 1", "File 2", "File 3" etc
This is the code I have got so far:
use 5.16.3;
use strict;
print "Enter Directory: ";
my $directoryPath = <>;
chdir('$directoryPath') or die "Cant chdir to $directoryPath$!";
#files = readdir(DIR); #Array of file names
closedir(DIR);
my $i = 1; #counting integer for file names
my $j = 0; #counting integer for array values
my $fileName = File;
for (#files)
{
rename (#files[j], $fileName + i) or die "Cant rename file #files[j]$!";
i++;
j++;
}
chdir; #return to home directory
I have a number of issues:
1: Whenever I try to change directory I get the 'or die' message. I am wondering if this is to do with the working directory I start from, do I need to go up to the C: directory by doing something like '..\' before traversing down through a different directory path?
2: Error message 'Bareword "File" not allowed while "strict subs" in use'
3: Same as point 2. but for "i" and for "j"
4: Error message 'Global symbol "#files" requires explicit package name'
Note: I can obviously only get error one if I comment out everything after line else the program won't compile.

No. You probably need to chomp($directoryPath) first to remove the newline, though. And remove the single quotes, since they do not allow interpolation. You never need to quote a single variable like that.
File should be "File". Otherwise it is a bareword.
j should be $j, and same for i
You must declare with my #files, just like you did with the other variables.
You should also know that + is not a concatenation operator in Perl. You should use . for that purpose. But you can also just interpolate it in a double quoted string. When referring to a single array element, you should also use the scalar sigil $, and not #:
rename($files[$j], "$fileName$i") or die ...
You have also forgotten to opendir before you readdir.
You are using a for loop, but not using the iterator value $_, instead using your own counter. You are using two counters where only one is needed. So you might as well do:
for my $i (0 .. #files) { # #files in scalar context returns its size
rename($files[$i], $fileName . ($i+1)) or die ...
}

Change chdir('$directoryPath') to chdir("$directoryPath") (Double quote for interpreting variable) and chomp it before
File -> "File"
i -> $i
Declare my #files

Related

Removing file extension from an array variable

I'm trying to remove the .png file extension that appears in many (but not all) of the variables of my outputted array. The array variables that show the extension are doing so because they weren't generated from file names in the format of "Genus_species#.png" where "#" is a number. Rather, they were generated from an un-numbered file name in the format of "Genus_species.png". I believe this line of code is creating this issue: "$genus = $file =~ s/\d.png$//r;". How do I resolve this? Please advise.
Here's my Perl script:
#!/usr/bin/perl
use strict;
use warnings;
use English; ## use names rather than symbols for special varables
my $dir = '/Users/jdm/Desktop/xampp/htdocs/cnc/images/plants';
opendir my $dfh, $dir or die "Can't open $dir: $OS_ERROR";
my %genus_species; ## store matching entries in a hash
for my $file (readdir $dfh)
{
next unless $file =~ /.png$/; ## entry must have .png extension
my $genus = $file =~ s/\d\.png$//r;
push(#{$genus_species{$genus}}, $file); ## push to array,the #{} is to cast the single entry to a referance to an list
}
for my $genus (keys %genus_species)
{
print "$genus = ";
print "$_, " for sort #{$genus_species{$genus}}; # sort and loop though entries in list referance
print "\n";
}
Here's the outputted array:
Euonymus_fortunei = Euonymus_fortunei1.png, Euonymus_fortunei2.png, Euonymus_fortunei3.png,
Polygonum_persicaria = Polygonum_persicaria1.png, Polygonum_persicaria2.png,
Polygonum_cuspidatum.png = Polygonum_cuspidatum.png,
Notice that the variable "Polygonum_cuspidatum.png" unwantingly includes the file extension because this variable was generated from a file that lacked a number in its name. Specifically, this variable should read:
Polygonum_cuspidatum = Polygonum_cuspidatum.png
Again, please advise how to resolve this issue. Thanks.
You're going to see the same issue if you ever have a multi-digit number in a filename. This is all due to the choice of regular expression:
s/\d\.png$//r
This looks for exactly one digit followed by .png. If you want no digit, or any number of digits before .png modify your regular expression as such:
s/\d*\.png$//r
That says "zero or more digits followed by .png at the end of the string".

Creating array of file names using grep

I'm having difficulty outputting file names as an array using grep. Specifically, I want to create an array of file names (plant photos) formatted like this:
Ilex_verticillata= Ilex_verticillata1.png, Ilex_verticillata2.png, Ilex_verticillata3.png
Asarum_canadense= Asarum_canadense1.png, Asarum_canadense2.png
Ageratina_altissi= Ageratina_altissi1.png, Ageratina_altissi2.png
Here's my original Perl script that I'm attempting to modify. It returns, as intended, ONE file name per plant as "Genus_species", printing a list of those plants:
#!/usr/bin/perl
use strict;
use warnings;
my $dir = '/Users/jdm/Desktop/xampp/htdocs/cnc/images/plants';
opendir my $dfh, $dir or die "Can't open $dir: $!";
my #files =
map { s/1\.png\z/.png/r } # Removes "1" from end of file names
grep { /^[^2-9]*\.png\z/i && /_/ } # Finds "Genus_species.png" & "Genus_species1.png" and returns one file name per plant as "Genus_species.png"
readdir $dfh;
foreach my$file (#files) {
$file =~s/\.png//; # Removes ".png" extension
print "$file\n"; #Prints list of file names (plant names)
}
Here's the output:
Ilex_verticillata
Asarum_canadense
Ageratina_altissima
However, since each plant often has MULTIPLE photos (e.g.-- "Genus_species1.png, Genus_species2.png, etc.), I need to re-grep the directory using the above output to find their file names, then output the results in the form of an array as previously illustrated.
I know the solution likely involves modifying the "foreach" statement, using grep to return ALL file names with "Genus_species" in their name. Here's what I tried:
foreach my$file (#files) {
$file =~s/\.png//;
grep ($file,readdir(DIR));
print "$file = $file\n";
But the output was this:
Ilex_verticillata = Ilex_verticillata
Asarum_canadense = Asarum_canadense
Ageratina_altissima = Ageratina_altissima
Again, I want to output an array formatted as:
"Genus_species= Genus_species1.png, Genus_species2.png, etc.," meaning I want it to look like this:
Ilex_verticillata= Ilex_verticillata1.png, Ilex_verticillata2.png, Ilex_verticillata3.png
Asarum_canadense= Asarum_canadense1.png, Asarum_canadense2.png
Ageratina_altissi= Ageratina_altissi1.png, Ageratina_altissi2.png
Notice that I also want to add back the ".png" extension ONLY to the file names to the right of the equals sign.
Please advise. Thanks.
Readdir returns a list of files in the folder. You've put them on one line, which is compact. However, if you loop them you can process the items further.
#!/usr/bin/perl
use strict;
use warnings;
use English; ## use names rather than symbols for special varables
my $dir = '/Users/jdm/Desktop/xampp/htdocs/cnc/images/plants';
opendir my $dfh, $dir or die "Can't open $dir: $OS_ERROR";
my %genus_species; ## store matching entries in a hash
for my $file (readdir $dfh)
{
next unless $file =~ /\d\.png$/; ## skip entry if not a png file ending with a number
my $genus = $file =~ s/\d\.png$//r;
push(#{$genus_species{$genus}}, $file); ## push to array,the #{} is to cast the single entry to a referance to an list
}
for my $genus (keys %genus_species)
{
print "$genus = ";
print "$_ " for sort #{$genus_species{$genus}}; # sort and loop though entries in list referance
print "\n";
}

Regexp to Compare partial filenames then moving to another directory perl

I am working on a script to compare non-running files within a dir to running files from a command. I have to use Regex to strip the front half of the filenames from the dir then regex to strip the filenames out of a command which then records the unmatched names into an array.
The part I cannot figure out is how I can move the filenames from the old dir into a new directory for future deletion.
In order to move the files I will need to enclose them in wildcards, * due to the random numbers in front of the filenames and the extention.
example filenames before and after:
within dir:
13209811124300209156562070_cake_872_trucks.rts
within command:
{"file 872","cake_872_trucks.rts",running}
in #events array:
cake_872_trucks
My code:
#!/usr/bin/perl -w
use strict;
use warnings;
use File::Copy qw(move);
use Data::Dumper;
use List::Util 'max';
my $orig_dir = "/var/user/data/";
my $dest_dir = "/var/user/data/DeleteMe/";
my $dir = "/var/user/data";
opendir(DIR, $dir) or die "Could not open $dir: $!\n";
my #allfiles = readdir DIR;
close DIR;
my %files;
foreach my $allfiles(#allfiles) {
$allfiles =~ m/^(13{2}638752056463{2}635181_|1[0-9]{22}_|1[0-9]{23}_|1[0-9]{24}_|1[0-9]{25}_)([0-9a-z]{4}_8[0-9a-z]{2}_[0-9a-z]{2}[a-z][0-9a-z]0[0-9]\.rts|[a-z][0-9a-z]{3}_[0-9a-z]{4}_8[0-9a-z]{2}_[0-9a-z]{2}[a-z]{2}0[0-9]\.rts|[a-z]{2}[0-9a-z][0-9]\N[0-9a-z]\N[0-9]\N[0-9]\N[0-9a-z]{4}\N[0-9]\.rts|[a-z]{2}[0-9a-z]{2}\N{2}[0-9a-z]{2}\N{2}[0-9][0-9a-z]{2}\N[0-9]{2}\.rts|S0{2}2_86F_JATD_01ZF\.rts)$/im;
$files{$2} = [$1];
}
my #stripfiles = keys %files;
my $cmd = "*****";
my #runEvents = `$cmd`;
chomp #runEvents;
foreach my $running(#runEvents) {
$running =~ s/^\{"blah 8[0-9a-z]{2}","(?<field2>CBE1_D{3}1_8EC_J6TG0{2}\.rts|[0-9a-z]{4}_8[0-9a-z]{2}_[0-9a-z]{2}[a-z][0-9a-z]0[0-9]\.rts|[a-z]{2}[0-9a-z]{2}\N{2}[0-9a-z]{2}\N{2}[0-9][0-9a-z]{2}\N[0-9]{2}\.rts)(?:",\{239,20,93,5\},310{2},20{3},run{2}ing\}|",\{239,20,93,5\},310{2},[0-9]{2}0{3},run{2}ing\}|",\{239,20,93,5\},310{2},[0-9]{3}0{4},run{2}ing\}|",\{239,20,93,5\},3[0-9]0{2},[0-9]{2}0{4},run{2}ing\})$/$+{field2}/img;
}
my #events = grep {my $x = $_; not grep {$x =~/\Q$_/i}#runEvents}#stripfiles;
foreach my $name (#events) {
my ($randnum, $fnames) = { $files{$name}};
my $combined = $randnum . $fnames;
print "Move $file from $orig_dir to $dest_dir";
move ("$orig_dir/$files{$name}", $dest_dir)
or warn "Can't move $file: $!";
}
#print scalar(grep $_, #stripfiles), "\n";
#returned 1626
#print scalar(grep $_, #runEvents), "\n";
#returned 102
#print scalar(grep $_, #allfiles), "\n";
#returned 1906
Once you are parsing filenames with regex there is no reason not to be able to capture all parts so that you can later reconstitute needed parts of the filename.
I assume that that overly long (and incomplete) regex does what it is meant to.
I am not sure how the files to move relate to the original files in #allfiles, since those are fetched from /var/user/data while your moving attempt uses /home/user/RunBackup. So code snippets below are more generic.
If what gets moved are precisely the files from #allfiles then just keep the file name
my %files;
foreach my $oldfile (#allfiles) {
$oldfile =~ m/...(...).../; # your regex, but capture the name
$files{$1} = $oldfile;
}
where by /...(...).../ I mean to indicate that you use your regex, but to which you add parenthesis around the part of the pattern that matches the name itself.
Then you can later retrieve the filename from the "name" of interest (cake_872_trucks).
If, however, the filename components may be needed to patch a different (while related) filename then capture and store the individual components
my %files;
foreach my $oldfile (#allfiles) {
$oldfile =~ m/(...)(...)(...)/; # your regex, just with capture groups
$files{$2} = [$1, $3]; # add to %files: name => [number, ext]
}
The regex only matches (why change names in #allfiles with s///?), and captures.
The first set of parenthesis captures that long leading factor (number) into $1, the second one gets the name (cake_872_trucks) into $2, and the third one has the extension, in $3.
So you end up with a hash with keys that are names of interest, with their values being arrayrefs with all other needed components of the filename. Please adjust as needed as I don't know what that regex does and may have missed some parts.
Now once you go through #events you can rebuild the name
use File::Copy qw(move);
foreach my $name (#events) {
my ($num, $ext) = #{ $files{$name} };
my $file = $num . $name . $ext;
say "Move $file from $orig_dir to $dest_dir";
move("$orig_dir/$file", $dest_dir) or warn "Can't move $file: $!";
}
But if the files to move are indeed from #allfiles (as would be the case in this example) then use the first version above to store filenames as values in %files and now retrieve them
foreach my $name (#events) {
move ("$orig_dir/$files{$name}", $dest_dir)
or warn "Can't move $file: $!";
}
I use the core module File::Copy, instead of going out to the system for the move command.
You can also rebuild the name by going through the directory again, now with names of interest on hand. But that'd be very expensive since you have to try to match every name in #events for every file read in the directory (O(mn) complexity).
What you asked about can be accomplished with glob (and note File::Glob's version)
my #files = glob "$dir/*${name}*";
but you'd have to do this for every $name -- a huge and needless waste of resources.
If that regex really must spell out specific numbers, here is a way to organize it for easier digestion (and debugging!): break it into reasonable parts, with a separate variable for each.
Ideally each part of alternation would be one variable
my $p1 = qr/.../;
my $p2 = qr/.../;
...
my $re_alt = join '|', $p1, $p2, ...;
my $re_other = qr/.../;
$var =~ m/^($re_alt)($re_other)(.*)$/; # adjust anchors, captures, etc
where the qr operator builds a regex pattern.
Adjust those capturing parenthesis, anchors, etc to your actual needs. Breaking it up so that the regex is sensibly split into variables will go a long way for readability, and thus correctness.
Assuming that there is a good reason to seek those specific numbers in filenames, this is also a good way to document any such fixed factors.
I guess you need something like this:
my $path = '/home/user/RunBackup/';
my #files = map {$path."*$_*"} #events;
system(join " ", "mv", #files, "/home/user/RunBackup/files/");
If there are lots of files you might need to move them one by one:
system(join " ", "mv", $_, "/home/user/RunBackup/files/") for #files;

Simplifying elements of a list/array and then adding incremental identifiers a,b,c,d.... etc to them

I'm processing headers of a .fasta file (which is a file universally used in genetics/bioinformatics to store DNA/RNA sequence data). Fasta files have headers starting with a > symbol (which gives specific info), followed by the actual sequence data on the next line that the header describes. The sequence data extends indefinitely until the next \n after which is followed the next header and its respective sequence. For example:
>scaffold1.1_size947603
ACGCTCGATCGTACCAGACTCAGCATGCATGACTGCATGCATGCATGCATCATCTGACTGATG....
>scaffold2.1_size747567.2.603063_605944
AGCTCTGATCGTCGAAATGCGCGCTCGCTAGCTCGATCGATCGATCGATCGACTCAGACCTCA....
and so on...
So, I have a problem with the fasta headers of the genome for the organism with which I am working with. Unfortunately the perl expertise needed to solve this problem seems to be beyond my current skill level :S So I was hoping someone on here could show me how it can be done.
My genome consists of about 25000 fasta headers and their respective sequences, the headers in their current state are giving me a lot of trouble with sequence aligners I am trying to use, so I have to simplify them significantly. Here is an example of my first few headers:
>scaffold1.1_size947603
>scaffold10.1_size550551
>scaffold100.1_size305125:1-38034
>scaffold100.1_size305125:38147-38987
>scaffold100.1_size305125:38995-44965
>scaffold100.1_size305125:76102-78738
>scaffold100.1_size305125:84171-87568
>scaffold100.1_size305125:87574-89457
>scaffold100.1_size305125:90495-305068
>scaffold1000.1_size94939
Essentially I would like to refine these to look like this:
>scaffold1.1a
>scaffold10.1a
>scaffold100.1a
>scaffold100.1b
>scaffold100.1c
>scaffold100.1d
>scaffold100.1e
>scaffold100.1f
>scaffold100.1g
>scaffold1000.1a
Or perhaps even this (but this seems like it would be more complicated):
>scaffold1.1
>scaffold10.1
>scaffold100.1a
>scaffold100.1b
>scaffold100.1c
>scaffold100.1d
>scaffold100.1e
>scaffold100.1f
>scaffold100.1g
>scaffold1000.1
What I'm doing here is getting rid of all the size data for each scaffold of the genome. For scaffolds that happen to be fragmented, I'd like to denote them with a,b,c,d etc. There are a few scaffolds with more than 26 fragments so perhaps I could denote them with x, y, z, A, B, C, D .... etc..
I was thinking to do this with a simple replace foreach loop similar to this:
#!/usr/bin/perl -w
### Open the files
$gen = './Hc_genome/haemonchus_V1.fa';
open(FASTAFILE, $gen);
#lines = <FASTAFILE>;
#print #lines;
###Add an # symbol to the start of the label
my #refined;
foreach my $lines (#lines){
chomp $lines;
$lines =~ s/match everything after .1/replace it with a, b, c.. etc/g;
push #refined, $lines;
}
#print #refined;
###Push the array on to a new fasta file
open FILE3, "> ./Hc_genome/modded_haemonchus_V1.fa" or die "Cannot open output.txt: $!";
foreach (#refined)
{
print FILE3 "$_\n"; # Print each entry in our array to the file
}
close FILE3;
But I don't know have to build in the added alphabetical label additions between the $1 and the \n in the match and replace operator. Essentially because I'm not sure how to do it sequentially through the alphabet for each fragment of a particular scaffold (All I could manage is to add an a to the start of each one...)
Please if you don't mind, let me know how I might achieve this!
Much appreciated!
Andrew
In Perl, the increment operator ++ has “magical” behaviour with respect to strings. E.g. my $s = "a"; $a++ increments $a to "b". This goes on until "z", where the increment will produce "aa" and so forth.
The headers of your file appear to be properly sorted, so we can just loop through each header. From the header, we extract the starting part (everything up to including the .1). If this starting part is the same as the starting part of the previous header, we increment our sequence identifier. Otherwise, we set it to "a":
use strict; use warnings; # start every script with these
my $index = "a";
my $prev = "";
# iterate over all lines (rather than reading all 25E3 into memory at once)
while (<>) {
# pass through non-header lines
unless (/^>/) {
print; # comment this line to remove non-header lines
next;
}
s/\.1\K.*//s; # remove everything after ".1". Implies chomping
# reset or increment $index
if ($_ eq $prev) {
$index++;
} else {
$index = "a";
}
# update the previous line
$prev = $_;
# output new header
print "$_$index\n";
}
Usage: $ perl script.pl <./Hc_genome/haemonchus_V1.fa >./Hc_genome/modded_haemonchus_V1.fa.
It is considered good style to write programs that accept input from STDIN and write to STDOUT, as this improves flexibility. Rather than hardcoding paths in your perl script, keep your script general, and use shell redirection operators like < to specify the input. This also saves you the hassle of manually opening the files.
Example Output:
>scaffold1.1a
>scaffold10.1a
>scaffold100.1a
>scaffold100.1b
>scaffold100.1c
>scaffold100.1d
>scaffold100.1e
>scaffold100.1f
>scaffold100.1g
>scaffold1000.1a

Dealing with hidden files when making an array of files inside a directory, using Perl

I am using Perl. I am making an array of files inside a directory. Hidden files, ones that begin with a dot, are at the beginning of my array. I want to actually ignore and skip over those, since I do not need them in the array. These are not the files I am looking for.
The solution to the problem seems easy. Just use regular expression to search for and exclude hidden files. Here's my code:
opendir(DIR, $ARGV[0]);
my #files = (readdir(DIR));
closedir(DIR);
print scalar #files."\n"; # used just to help check on how long the array is
for ( my $i = 0; $i < #files; $i++ )
{
# ^ as an anchor, \. for literal . and second . for match any following character
if ( $files[ $i ] =~ m/^\../ || $files[ $i ] eq '.' ) #
{
print "$files[ $i ] is a hidden file\n";
print scalar #files."\n";
}
else
{
print $files[ $i ] . "\n";
}
} # end of for loop
This produces an array #files and shows me the hidden files I have in the directory. Next step is to remove the hidden files from the array #files. So use the shift function, like this:
opendir(DIR, $ARGV[0]);
my #files = (readdir(DIR));
closedir(DIR);
print scalar #files."\n"; # used to just to help check on how long the array is
for ( my $i = 0; $i < #files; $i++ )
{
# ^ as an anchor, \. for literal . and second . for match any following character
if ( $files[ $i ] =~ m/^\../ || $files[ $i ] eq '.' ) #
{
print "$files[ $i ] is a hidden file\n";
shift #files;
print scalar #files."\n";
}
else
{
print $files[ $i ] . "\n";
}
} # end of for loop
I get an unexpected result. My expectation is that the script will:
make the array #files,
scan through that array looking for files that begin with a dot,
find a hidden file, tell me it found one, then promptly shift it off the front end of the array #files,
then report to me the size or length of #files,
otherwise, just print the name of the files that I am actually interested in using.
The first script works fine. The second version of the script, the one using the shift function to remove hidden files from #files, does find the first hidden file (. or current directory) and shifts it off. It does not report back to me about .., the parent directory. It also does not find another hidden file that is currently in my directory to test things out. That hidden file is a .DS_store file. But on the other had, it does find a hidden .swp file and shifts it out.
I can't account for this. Why does the script work OK for the current directory . but not the parental directory ..? And also, why does the script work OK for a hidden .swp file but not the hidden .DS_Store file?
After shifting a file, your index $i now points to the following file.
You can use grep to get rid of the files whose names start with a dot, no shifting needed:
my #files = grep ! /^\./, readdir DIR;

Resources