Regexp to Compare partial filenames then moving to another directory perl - file

I am working on a script to compare non-running files within a dir to running files from a command. I have to use Regex to strip the front half of the filenames from the dir then regex to strip the filenames out of a command which then records the unmatched names into an array.
The part I cannot figure out is how I can move the filenames from the old dir into a new directory for future deletion.
In order to move the files I will need to enclose them in wildcards, * due to the random numbers in front of the filenames and the extention.
example filenames before and after:
within dir:
13209811124300209156562070_cake_872_trucks.rts
within command:
{"file 872","cake_872_trucks.rts",running}
in #events array:
cake_872_trucks
My code:
#!/usr/bin/perl -w
use strict;
use warnings;
use File::Copy qw(move);
use Data::Dumper;
use List::Util 'max';
my $orig_dir = "/var/user/data/";
my $dest_dir = "/var/user/data/DeleteMe/";
my $dir = "/var/user/data";
opendir(DIR, $dir) or die "Could not open $dir: $!\n";
my #allfiles = readdir DIR;
close DIR;
my %files;
foreach my $allfiles(#allfiles) {
$allfiles =~ m/^(13{2}638752056463{2}635181_|1[0-9]{22}_|1[0-9]{23}_|1[0-9]{24}_|1[0-9]{25}_)([0-9a-z]{4}_8[0-9a-z]{2}_[0-9a-z]{2}[a-z][0-9a-z]0[0-9]\.rts|[a-z][0-9a-z]{3}_[0-9a-z]{4}_8[0-9a-z]{2}_[0-9a-z]{2}[a-z]{2}0[0-9]\.rts|[a-z]{2}[0-9a-z][0-9]\N[0-9a-z]\N[0-9]\N[0-9]\N[0-9a-z]{4}\N[0-9]\.rts|[a-z]{2}[0-9a-z]{2}\N{2}[0-9a-z]{2}\N{2}[0-9][0-9a-z]{2}\N[0-9]{2}\.rts|S0{2}2_86F_JATD_01ZF\.rts)$/im;
$files{$2} = [$1];
}
my #stripfiles = keys %files;
my $cmd = "*****";
my #runEvents = `$cmd`;
chomp #runEvents;
foreach my $running(#runEvents) {
$running =~ s/^\{"blah 8[0-9a-z]{2}","(?<field2>CBE1_D{3}1_8EC_J6TG0{2}\.rts|[0-9a-z]{4}_8[0-9a-z]{2}_[0-9a-z]{2}[a-z][0-9a-z]0[0-9]\.rts|[a-z]{2}[0-9a-z]{2}\N{2}[0-9a-z]{2}\N{2}[0-9][0-9a-z]{2}\N[0-9]{2}\.rts)(?:",\{239,20,93,5\},310{2},20{3},run{2}ing\}|",\{239,20,93,5\},310{2},[0-9]{2}0{3},run{2}ing\}|",\{239,20,93,5\},310{2},[0-9]{3}0{4},run{2}ing\}|",\{239,20,93,5\},3[0-9]0{2},[0-9]{2}0{4},run{2}ing\})$/$+{field2}/img;
}
my #events = grep {my $x = $_; not grep {$x =~/\Q$_/i}#runEvents}#stripfiles;
foreach my $name (#events) {
my ($randnum, $fnames) = { $files{$name}};
my $combined = $randnum . $fnames;
print "Move $file from $orig_dir to $dest_dir";
move ("$orig_dir/$files{$name}", $dest_dir)
or warn "Can't move $file: $!";
}
#print scalar(grep $_, #stripfiles), "\n";
#returned 1626
#print scalar(grep $_, #runEvents), "\n";
#returned 102
#print scalar(grep $_, #allfiles), "\n";
#returned 1906

Once you are parsing filenames with regex there is no reason not to be able to capture all parts so that you can later reconstitute needed parts of the filename.
I assume that that overly long (and incomplete) regex does what it is meant to.
I am not sure how the files to move relate to the original files in #allfiles, since those are fetched from /var/user/data while your moving attempt uses /home/user/RunBackup. So code snippets below are more generic.
If what gets moved are precisely the files from #allfiles then just keep the file name
my %files;
foreach my $oldfile (#allfiles) {
$oldfile =~ m/...(...).../; # your regex, but capture the name
$files{$1} = $oldfile;
}
where by /...(...).../ I mean to indicate that you use your regex, but to which you add parenthesis around the part of the pattern that matches the name itself.
Then you can later retrieve the filename from the "name" of interest (cake_872_trucks).
If, however, the filename components may be needed to patch a different (while related) filename then capture and store the individual components
my %files;
foreach my $oldfile (#allfiles) {
$oldfile =~ m/(...)(...)(...)/; # your regex, just with capture groups
$files{$2} = [$1, $3]; # add to %files: name => [number, ext]
}
The regex only matches (why change names in #allfiles with s///?), and captures.
The first set of parenthesis captures that long leading factor (number) into $1, the second one gets the name (cake_872_trucks) into $2, and the third one has the extension, in $3.
So you end up with a hash with keys that are names of interest, with their values being arrayrefs with all other needed components of the filename. Please adjust as needed as I don't know what that regex does and may have missed some parts.
Now once you go through #events you can rebuild the name
use File::Copy qw(move);
foreach my $name (#events) {
my ($num, $ext) = #{ $files{$name} };
my $file = $num . $name . $ext;
say "Move $file from $orig_dir to $dest_dir";
move("$orig_dir/$file", $dest_dir) or warn "Can't move $file: $!";
}
But if the files to move are indeed from #allfiles (as would be the case in this example) then use the first version above to store filenames as values in %files and now retrieve them
foreach my $name (#events) {
move ("$orig_dir/$files{$name}", $dest_dir)
or warn "Can't move $file: $!";
}
I use the core module File::Copy, instead of going out to the system for the move command.
You can also rebuild the name by going through the directory again, now with names of interest on hand. But that'd be very expensive since you have to try to match every name in #events for every file read in the directory (O(mn) complexity).
What you asked about can be accomplished with glob (and note File::Glob's version)
my #files = glob "$dir/*${name}*";
but you'd have to do this for every $name -- a huge and needless waste of resources.
If that regex really must spell out specific numbers, here is a way to organize it for easier digestion (and debugging!): break it into reasonable parts, with a separate variable for each.
Ideally each part of alternation would be one variable
my $p1 = qr/.../;
my $p2 = qr/.../;
...
my $re_alt = join '|', $p1, $p2, ...;
my $re_other = qr/.../;
$var =~ m/^($re_alt)($re_other)(.*)$/; # adjust anchors, captures, etc
where the qr operator builds a regex pattern.
Adjust those capturing parenthesis, anchors, etc to your actual needs. Breaking it up so that the regex is sensibly split into variables will go a long way for readability, and thus correctness.
Assuming that there is a good reason to seek those specific numbers in filenames, this is also a good way to document any such fixed factors.

I guess you need something like this:
my $path = '/home/user/RunBackup/';
my #files = map {$path."*$_*"} #events;
system(join " ", "mv", #files, "/home/user/RunBackup/files/");
If there are lots of files you might need to move them one by one:
system(join " ", "mv", $_, "/home/user/RunBackup/files/") for #files;

Related

Removing file extension from an array variable

I'm trying to remove the .png file extension that appears in many (but not all) of the variables of my outputted array. The array variables that show the extension are doing so because they weren't generated from file names in the format of "Genus_species#.png" where "#" is a number. Rather, they were generated from an un-numbered file name in the format of "Genus_species.png". I believe this line of code is creating this issue: "$genus = $file =~ s/\d.png$//r;". How do I resolve this? Please advise.
Here's my Perl script:
#!/usr/bin/perl
use strict;
use warnings;
use English; ## use names rather than symbols for special varables
my $dir = '/Users/jdm/Desktop/xampp/htdocs/cnc/images/plants';
opendir my $dfh, $dir or die "Can't open $dir: $OS_ERROR";
my %genus_species; ## store matching entries in a hash
for my $file (readdir $dfh)
{
next unless $file =~ /.png$/; ## entry must have .png extension
my $genus = $file =~ s/\d\.png$//r;
push(#{$genus_species{$genus}}, $file); ## push to array,the #{} is to cast the single entry to a referance to an list
}
for my $genus (keys %genus_species)
{
print "$genus = ";
print "$_, " for sort #{$genus_species{$genus}}; # sort and loop though entries in list referance
print "\n";
}
Here's the outputted array:
Euonymus_fortunei = Euonymus_fortunei1.png, Euonymus_fortunei2.png, Euonymus_fortunei3.png,
Polygonum_persicaria = Polygonum_persicaria1.png, Polygonum_persicaria2.png,
Polygonum_cuspidatum.png = Polygonum_cuspidatum.png,
Notice that the variable "Polygonum_cuspidatum.png" unwantingly includes the file extension because this variable was generated from a file that lacked a number in its name. Specifically, this variable should read:
Polygonum_cuspidatum = Polygonum_cuspidatum.png
Again, please advise how to resolve this issue. Thanks.
You're going to see the same issue if you ever have a multi-digit number in a filename. This is all due to the choice of regular expression:
s/\d\.png$//r
This looks for exactly one digit followed by .png. If you want no digit, or any number of digits before .png modify your regular expression as such:
s/\d*\.png$//r
That says "zero or more digits followed by .png at the end of the string".

Creating array of file names using grep

I'm having difficulty outputting file names as an array using grep. Specifically, I want to create an array of file names (plant photos) formatted like this:
Ilex_verticillata= Ilex_verticillata1.png, Ilex_verticillata2.png, Ilex_verticillata3.png
Asarum_canadense= Asarum_canadense1.png, Asarum_canadense2.png
Ageratina_altissi= Ageratina_altissi1.png, Ageratina_altissi2.png
Here's my original Perl script that I'm attempting to modify. It returns, as intended, ONE file name per plant as "Genus_species", printing a list of those plants:
#!/usr/bin/perl
use strict;
use warnings;
my $dir = '/Users/jdm/Desktop/xampp/htdocs/cnc/images/plants';
opendir my $dfh, $dir or die "Can't open $dir: $!";
my #files =
map { s/1\.png\z/.png/r } # Removes "1" from end of file names
grep { /^[^2-9]*\.png\z/i && /_/ } # Finds "Genus_species.png" & "Genus_species1.png" and returns one file name per plant as "Genus_species.png"
readdir $dfh;
foreach my$file (#files) {
$file =~s/\.png//; # Removes ".png" extension
print "$file\n"; #Prints list of file names (plant names)
}
Here's the output:
Ilex_verticillata
Asarum_canadense
Ageratina_altissima
However, since each plant often has MULTIPLE photos (e.g.-- "Genus_species1.png, Genus_species2.png, etc.), I need to re-grep the directory using the above output to find their file names, then output the results in the form of an array as previously illustrated.
I know the solution likely involves modifying the "foreach" statement, using grep to return ALL file names with "Genus_species" in their name. Here's what I tried:
foreach my$file (#files) {
$file =~s/\.png//;
grep ($file,readdir(DIR));
print "$file = $file\n";
But the output was this:
Ilex_verticillata = Ilex_verticillata
Asarum_canadense = Asarum_canadense
Ageratina_altissima = Ageratina_altissima
Again, I want to output an array formatted as:
"Genus_species= Genus_species1.png, Genus_species2.png, etc.," meaning I want it to look like this:
Ilex_verticillata= Ilex_verticillata1.png, Ilex_verticillata2.png, Ilex_verticillata3.png
Asarum_canadense= Asarum_canadense1.png, Asarum_canadense2.png
Ageratina_altissi= Ageratina_altissi1.png, Ageratina_altissi2.png
Notice that I also want to add back the ".png" extension ONLY to the file names to the right of the equals sign.
Please advise. Thanks.
Readdir returns a list of files in the folder. You've put them on one line, which is compact. However, if you loop them you can process the items further.
#!/usr/bin/perl
use strict;
use warnings;
use English; ## use names rather than symbols for special varables
my $dir = '/Users/jdm/Desktop/xampp/htdocs/cnc/images/plants';
opendir my $dfh, $dir or die "Can't open $dir: $OS_ERROR";
my %genus_species; ## store matching entries in a hash
for my $file (readdir $dfh)
{
next unless $file =~ /\d\.png$/; ## skip entry if not a png file ending with a number
my $genus = $file =~ s/\d\.png$//r;
push(#{$genus_species{$genus}}, $file); ## push to array,the #{} is to cast the single entry to a referance to an list
}
for my $genus (keys %genus_species)
{
print "$genus = ";
print "$_ " for sort #{$genus_species{$genus}}; # sort and loop though entries in list referance
print "\n";
}

Perl regex capture grouping

Is there a better way to write the below code using a regex capture grouping?
I'm looking to get the folder name immediately after Recordings.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $path1 = '\\server1\Recordings\AU-AR-Online\Outbound\20160120\52031\52031-14-07-16.wav';
my $path2 = '\\server2\e$\Recordings\SU-AC-Online\Outbound\20160120\52031\52031-14-07-16.wav';
my #paths = ( $path1,$path2 );
foreach my $path (#paths) {
# Split path into fields
my #array = (split /\\/, $path);
# Get index of Recordings
my( $index )= grep { $array[$_] eq "Recordings" } 0..$#array;
# Brand always follows Recordings
print $array[$index+1];
}
Sure, just replace the content of your loop with this:
my ($brand) = $path =~ m|\\Recordings\\([^\\]+)| or die "Not found";
print $brand;
Capture the folder directly after Recordings\
my ($brand) = $path =~ m{ Recordings \\ ( [^\\] + ) }x )
Using the x modifier on the regex means whitespace is ignore, which can help make the regex more readable.
If the brand folder is always 5th-last, you could split the path and grab it by negative indexing.
my $brand = (split /\\/, $path)[-5];
But again, that only works if the brand is always 5th last. I don't know what your data set is.
Also, if your working with paths, there are many modules (such as Path::Tiny) that make it easier to get parent/child/absolute paths, basenames, etc.

Call upon specific elements from array

Ok, so I have a bunch of file names possessing one of the following two formats:
Sample-ID_Adapter-Sequence_L001_R1_001.fastq (As Forward)
Sample-ID_Adapter-Sequence_L001_R2_001.fastq (As Reverse)
The only difference between the forward and reverse formats is the R1 and R2 elements in the filename. Now, I've managed to enable the user to provide the directory containing these files with the following script:
#!/usr/bin/perl
use strict;
use warnings;
#Print Directory
print "Please provide the directory containing the FASTQ files from your Illumina MiSeq run \n";
my $FASTQ = <STDIN>;
chomp ($FASTQ);
#Open Directory
my $dir = $FASTQ;
opendir(DIR, $dir) or die "Cannot open $dir: $!";
my #forwardreads = grep { /R1_001.fastq/ } readdir DIR;
closedir DIR;
my $direct = $FASTQ;
opendir(DIR, $direct) or die "Cannot open $dir: $!";
my #reversereads = grep { /R2_001.fastq/ } readdir DIR;
closedir DIR;
foreach my $ffile (#forwardreads) {
my $forward = $ffile;
print $forward;
}
foreach my $rfile (#reversereads) {
my $reverse = $rfile;
print $reverse;
}
The Problem
What I want to do with the above script is to find a way to pair up the elements of both arrays that are derived from the same Sample ID. Like I said, the only difference between the forward and reverse files (from the same sample ID) would be the R1 and the R2 parts of the file name.
I've tried looking up ways to extract elements from an array, but I want to let the program do the matching instead of me.
Thanks for reading and I hope you guys can help!
You'll have to parse out the filename. Fortunately, this is pretty straightforward. After stripping the extension, you can split the pieces on _.
# Strip the file extension.
my($suffix) = $filename =~ s{\.(.*?)$}{};
# Parse Sample-ID_Adapter-Sequence_L001_R1_001
my($sample_id, $adapter_sequence, $uhh, $format, $yeah) = split /_/, $filename;
Now you can do what you like with them.
I'd suggest a few things to improve the code. First, put that filename parsing into a function so it can be reused and to keep the main code simpler. Second, parse the filenames into a hash rather than a bunch of scalars, it'll be easier to work with and pass around. Finally, include the filename itself in that hash, then the hash contains the complete data. This, btw, is a gateway drug to OO programming.
sub parse_fastq_filename {
# Read the next (in this case first and only) argument.
my $filename = shift;
# Strip the suffix
my($suffix) = $filename =~ s{\.(.*?)$}{};
# Parse Sample-ID_Adapter-Sequence_L001_R1_001
my($sample_id, $adapter_sequence, $uhh, $format, $yeah) = split /_/, $filename;
return {
filename => $filename,
sample_id => $sample_id,
adapter_sequence => $adapter_sequence,
uhh => $uhh,
format => $format,
yeah => $yeah
};
}
Then rather than finding left and right formatted files separately, process everything in one loop. Put matching left and right pairs in a hash. Use glob to pick up only the .fastq files.
# This is where the pairs of files will be stored.
my %pairs;
# List just the *.fastq files
while( my $filename = glob("$FASTQ_DIR/*.fastq")) {
# Parse the filename into a hash reference
my $fastq = parse_fastq_filename($filename);
# Put each parsed fastq filename into its pair
$pairs{ $fastq->{sample_id} }{ $fastq->{format} } = $fastq;
}
Then you can do what you like with %pairs. Here's an example to print out each sample ID and what formats it has.
# Iterate through each sample and pair.
# $sample is a hash ref of format pairs
for my $sample (values %pairs) {
# Now iterate through each pair in the sample
for my $fastq (values %$sample) {
say "$fastq->{sample_id} has format $fastq->{format}";
}
}

Perl, matching files of a directory, using an array with part of the these file names

So, I have this directory with files named like this:
HG00119.mapped.ILLUMINA.bwa.GBR.low_coverage.20101123.bam.bai
HG00119.mapped.ILLUMINA.bwa.GBR.exome.20120522.bam_herc2_data.bam
HG00117.mapped.illumina.mosaik.GBR.exome.20110411.bam_herc2_phase1.bam
HG00117.mapped.illumina.mosaik.GBR.exome.20110411.bam.bai
NA20828.mapped.illumina.mosaik.TSI.exome.20110411.bam_herc2_phase1.bam
NA20828.mapped.ILLUMINA.bwa.TSI.low_coverage.20130415.bam_herc2_data.bam
And I have a input.txt file that contains in each line.
NA20828
HG00119
As you can see, the input.txt file has the beginning of the name of the files inside the directory.
What I want to do is to filter the files in the directory that have the name (in this case just the beginning), inside the input.txt.
I don't know if I was clear, but here is the code I've done so far.
use strict;
use warnings;
my #lines;
my #files = glob("*.mapped*");
open (my $input,'<','input.txt') or die $!;
while (my $line = <$input>) {
push (#lines, $line);
}
close $input;
I used the glob to filter only the files with mapped in the name, since I have other files there that I don't want to look for.
I tried some foreach loops, tried grep and regex also, and I'm pretty sure that I was going in the right way, and I think my mistake might be about scope.
I would appreciate any help guys! thanks!
OK, first off - your while loop is redundant. If you read from a filehandle in a list context, it reads the whole thing.
my #lines = <$input>;
will do the same as your while loop.
Now, for your patterns - you're matching one list against another list, but partial matches.
chomp ( #lines );
foreach my $file ( #files ) {
foreach my $line ( #lines ) {
if ( $file =~ m/$line/ ) { print "$file matches $line"; }
}
}
(And yes, something like grep or map can do this, but I always find those two make my head hurt - they're neater, but they're implicitly looping so you don't really gain much algorithmic efficiency).
You can build a regular expression from the contents of input.txt like this
my #lines = do {
open my $fh, '<', 'input.txt' or die $!;
<$fh>;
};
chomp #lines;
my $re = join '|', #lines;
and then find the required files using
my #files = grep /^(?:$re)/, glob '*.mapped*';
Note that, if the list in input.txt contains any regex metacharacters, such as ., *, + etc. you will need to escape them, probably using quotemeta like this
my $re = join '|', map quotemeta, #lines;
and it may be best to do this anyway unless you are certain that there will never ever be such characters in the file.

Resources