Creating array of file names using grep - arrays

I'm having difficulty outputting file names as an array using grep. Specifically, I want to create an array of file names (plant photos) formatted like this:
Ilex_verticillata= Ilex_verticillata1.png, Ilex_verticillata2.png, Ilex_verticillata3.png
Asarum_canadense= Asarum_canadense1.png, Asarum_canadense2.png
Ageratina_altissi= Ageratina_altissi1.png, Ageratina_altissi2.png
Here's my original Perl script that I'm attempting to modify. It returns, as intended, ONE file name per plant as "Genus_species", printing a list of those plants:
#!/usr/bin/perl
use strict;
use warnings;
my $dir = '/Users/jdm/Desktop/xampp/htdocs/cnc/images/plants';
opendir my $dfh, $dir or die "Can't open $dir: $!";
my #files =
map { s/1\.png\z/.png/r } # Removes "1" from end of file names
grep { /^[^2-9]*\.png\z/i && /_/ } # Finds "Genus_species.png" & "Genus_species1.png" and returns one file name per plant as "Genus_species.png"
readdir $dfh;
foreach my$file (#files) {
$file =~s/\.png//; # Removes ".png" extension
print "$file\n"; #Prints list of file names (plant names)
}
Here's the output:
Ilex_verticillata
Asarum_canadense
Ageratina_altissima
However, since each plant often has MULTIPLE photos (e.g.-- "Genus_species1.png, Genus_species2.png, etc.), I need to re-grep the directory using the above output to find their file names, then output the results in the form of an array as previously illustrated.
I know the solution likely involves modifying the "foreach" statement, using grep to return ALL file names with "Genus_species" in their name. Here's what I tried:
foreach my$file (#files) {
$file =~s/\.png//;
grep ($file,readdir(DIR));
print "$file = $file\n";
But the output was this:
Ilex_verticillata = Ilex_verticillata
Asarum_canadense = Asarum_canadense
Ageratina_altissima = Ageratina_altissima
Again, I want to output an array formatted as:
"Genus_species= Genus_species1.png, Genus_species2.png, etc.," meaning I want it to look like this:
Ilex_verticillata= Ilex_verticillata1.png, Ilex_verticillata2.png, Ilex_verticillata3.png
Asarum_canadense= Asarum_canadense1.png, Asarum_canadense2.png
Ageratina_altissi= Ageratina_altissi1.png, Ageratina_altissi2.png
Notice that I also want to add back the ".png" extension ONLY to the file names to the right of the equals sign.
Please advise. Thanks.

Readdir returns a list of files in the folder. You've put them on one line, which is compact. However, if you loop them you can process the items further.
#!/usr/bin/perl
use strict;
use warnings;
use English; ## use names rather than symbols for special varables
my $dir = '/Users/jdm/Desktop/xampp/htdocs/cnc/images/plants';
opendir my $dfh, $dir or die "Can't open $dir: $OS_ERROR";
my %genus_species; ## store matching entries in a hash
for my $file (readdir $dfh)
{
next unless $file =~ /\d\.png$/; ## skip entry if not a png file ending with a number
my $genus = $file =~ s/\d\.png$//r;
push(#{$genus_species{$genus}}, $file); ## push to array,the #{} is to cast the single entry to a referance to an list
}
for my $genus (keys %genus_species)
{
print "$genus = ";
print "$_ " for sort #{$genus_species{$genus}}; # sort and loop though entries in list referance
print "\n";
}

Related

Removing file extension from an array variable

I'm trying to remove the .png file extension that appears in many (but not all) of the variables of my outputted array. The array variables that show the extension are doing so because they weren't generated from file names in the format of "Genus_species#.png" where "#" is a number. Rather, they were generated from an un-numbered file name in the format of "Genus_species.png". I believe this line of code is creating this issue: "$genus = $file =~ s/\d.png$//r;". How do I resolve this? Please advise.
Here's my Perl script:
#!/usr/bin/perl
use strict;
use warnings;
use English; ## use names rather than symbols for special varables
my $dir = '/Users/jdm/Desktop/xampp/htdocs/cnc/images/plants';
opendir my $dfh, $dir or die "Can't open $dir: $OS_ERROR";
my %genus_species; ## store matching entries in a hash
for my $file (readdir $dfh)
{
next unless $file =~ /.png$/; ## entry must have .png extension
my $genus = $file =~ s/\d\.png$//r;
push(#{$genus_species{$genus}}, $file); ## push to array,the #{} is to cast the single entry to a referance to an list
}
for my $genus (keys %genus_species)
{
print "$genus = ";
print "$_, " for sort #{$genus_species{$genus}}; # sort and loop though entries in list referance
print "\n";
}
Here's the outputted array:
Euonymus_fortunei = Euonymus_fortunei1.png, Euonymus_fortunei2.png, Euonymus_fortunei3.png,
Polygonum_persicaria = Polygonum_persicaria1.png, Polygonum_persicaria2.png,
Polygonum_cuspidatum.png = Polygonum_cuspidatum.png,
Notice that the variable "Polygonum_cuspidatum.png" unwantingly includes the file extension because this variable was generated from a file that lacked a number in its name. Specifically, this variable should read:
Polygonum_cuspidatum = Polygonum_cuspidatum.png
Again, please advise how to resolve this issue. Thanks.
You're going to see the same issue if you ever have a multi-digit number in a filename. This is all due to the choice of regular expression:
s/\d\.png$//r
This looks for exactly one digit followed by .png. If you want no digit, or any number of digits before .png modify your regular expression as such:
s/\d*\.png$//r
That says "zero or more digits followed by .png at the end of the string".

Creating a JSON array containing multiple variables

I want to modify my Perl script to output a list of variables using the json_encode function, but I'm not sure how.
Here's the output of my unmodified Perl script:
Vicia_sativa = Vicia_sativa.png
Geranium_maculatum = Geranium_maculatum.png
Narcissus_pseudonarcissus = Narcissus_pseudonarcissus1.png Narcissus_pseudonarcissus2.png
Polygonum_persicaria = Polygonum_persicaria1.png Polygonum_persicaria2.png
Corylus_americana = Corylus_americana1.png Corylus_americana2.png
The variables to the left of the equal signs are plant names, and the one or more file names to the right of the equal signs are plant photos. Notice that these file names are not separated by commas.
Here's my Perl script that generated the above output:
#!/usr/bin/perl
use strict;
use warnings;
use English; ## use names rather than symbols for special variables
my $dir = '/Users/jdm/Desktop/xampp/htdocs/cnc/images/plants';
opendir my $dfh, $dir or die "Can't open $dir: $OS_ERROR";
my %genus_species; ## store matching entries in a hash
for my $file (readdir $dfh)
{
next unless $file =~ /.png$/; ## entry must have .png extension
my $genus = $file =~ s/\d*\.png$//r;
push(#{$genus_species{$genus}}, $file); ## push to array,the #{} is to cast the single entry to a reference to an list
}
for my $genus (keys %genus_species)
{
print "$genus = ";
print "$_ " for sort #{$genus_species{$genus}}; # sort and loop though entries in list reference
print "\n";
}
Please advise how to output these variables in a JSON array. Thanks.
Update... Here's the revised script with the recommended changes per a forum member:
#!/usr/bin/perl
use strict;
use warnings;
use JSON::PP;
use English; ## use names rather than symbols for special variables
my $dir = '/Users/jdm/Desktop/xampp/htdocs/cnc/images/plants';
opendir my $dfh, $dir or die "Can't open $dir: $OS_ERROR";
my %genus_species; ## store matching entries in a hash
for my $file (readdir $dfh)
{
next unless $file =~ /.png$/; ## entry must have .png extension
my $genus = $file =~ s/\d*\.png$//r;
push(#{$genus_species{$genus}}, $file); ## push to array,the #{} is to cast the single entry to a reference to an list
}
print(encode_json(\%genus_species));
This revised code works! However, the file names are no longer sorted. Any ideas how to incorporate sort into the encode_json function?
Pretty straight forward actually... You can use the JSON::PP module and pass a reference of your hash to encode_json().
#!/usr/bin/perl
use strict;
use warnings;
use JSON::PP;
# your other code goes here...
# instead of `for my $genus (keys %genus_species) { ... }` do:
print(encode_json(\%genus_species));

Need json_encode to return sorted array

I need json_encode to return a sorted array, but can't figure out how.
Here's my Perl script:
#!/usr/bin/perl
use strict;
use warnings;
use JSON::PP;
use English; ## use names rather than symbols for special variables
my $dir = '/Users/jdm/Desktop/xampp/htdocs/cnc/images/plants';
opendir my $dfh, $dir or die "Can't open $dir: $OS_ERROR";
my %genus_species; ## store matching entries in a hash
for my $file (readdir $dfh)
{
next unless $file =~ /.png$/; ## entry must have .png extension
my $genus = $file =~ s/\d*\.png$//r;
push(#{$genus_species{$genus}}, $file); ## push to array,the #{} is to cast the single entry to a reference to an list
}
print "var galleryarray = "; ## HTML Variable element ( <var> )
print (encode_json(\%genus_species)); ## define array in Javascript outputting elements containing image file names
Here's part of the output, showing the unsorted elements:
var galleryarray = {"Polygonum_pensylvanicum":["Polygonum_pensylvanicum2.png","Polygonum_pensylvanicum3.png","Polygonum_pensylvanicum1.png"]
Notice that the indexed file names are unsorted numerically.
Before posting, I tried adding the following sort function below the push function:
sort(#{$genus_species{$genus}}, $file);
Unfortunately, that caused an error; specifically "Useless use of sort in void context."
Please advise how I can output a sorted array using json_encode. Thanks.
The following is the proper syntax:
#{$genus_species{$genus}} = sort #{$genus_species{$genus}};
It's inefficient to repeatedly sort the arrays. Instead, you should create a second loop after the first.
for my $genus (keys(%genus_species)) {
#{$genus_species{$genus}} = sort #{$genus_species{$genus}};
}
or
#{$genus_species{$_}} = sort #{$genus_species{$_}}
for keys(%genus_species);
But your strings have a mix of text and numbers, so a natural sort would be better.
use Sort::Key::Natural qw( natsort );
#{$genus_species{$_}} = natsort #{$genus_species{$_}}
for keys(%genus_species);
If you want to the keys to be sorted too, replace
encode_json(...)
with
JSON::PP->new->utf8->canonical->encode(...)
or the faster
Cpanel::JSON::XS->new->utf8->canonical->encode(...)
---UPDATE--- One of our forum members provided a sort expression that allows the json_encode function to return a sorted array. I want to share his solution to help others.
Here's my revised script that returns a sorted array:
#!/usr/bin/perl
use strict;
use warnings;
use JSON::PP;
use English; ## use names rather than symbols for special variables
my $dir = '/Users/jdm/Desktop/xampp/htdocs/cnc/images/plants';
opendir my $dfh, $dir or die "Can't open $dir: $OS_ERROR";
my %genus_species; ## store matching entries in a hash
for my $file (readdir $dfh)
{
next unless $file =~ /.png$/; ## entry must have .png extension
my $genus = $file =~ s/\d*\.png$//r;
push(#{$genus_species{$genus}}, $file); ## push to array,the #{} is to cast the single entry to a reference to an list
}
#{$genus_species{$_}} = sort #{$genus_species{$_}}
for keys(%genus_species);
print (JSON::PP->new->utf8->canonical->encode(\%genus_species)); ## define array in Javascript outputting elements containing image file names

Regexp to Compare partial filenames then moving to another directory perl

I am working on a script to compare non-running files within a dir to running files from a command. I have to use Regex to strip the front half of the filenames from the dir then regex to strip the filenames out of a command which then records the unmatched names into an array.
The part I cannot figure out is how I can move the filenames from the old dir into a new directory for future deletion.
In order to move the files I will need to enclose them in wildcards, * due to the random numbers in front of the filenames and the extention.
example filenames before and after:
within dir:
13209811124300209156562070_cake_872_trucks.rts
within command:
{"file 872","cake_872_trucks.rts",running}
in #events array:
cake_872_trucks
My code:
#!/usr/bin/perl -w
use strict;
use warnings;
use File::Copy qw(move);
use Data::Dumper;
use List::Util 'max';
my $orig_dir = "/var/user/data/";
my $dest_dir = "/var/user/data/DeleteMe/";
my $dir = "/var/user/data";
opendir(DIR, $dir) or die "Could not open $dir: $!\n";
my #allfiles = readdir DIR;
close DIR;
my %files;
foreach my $allfiles(#allfiles) {
$allfiles =~ m/^(13{2}638752056463{2}635181_|1[0-9]{22}_|1[0-9]{23}_|1[0-9]{24}_|1[0-9]{25}_)([0-9a-z]{4}_8[0-9a-z]{2}_[0-9a-z]{2}[a-z][0-9a-z]0[0-9]\.rts|[a-z][0-9a-z]{3}_[0-9a-z]{4}_8[0-9a-z]{2}_[0-9a-z]{2}[a-z]{2}0[0-9]\.rts|[a-z]{2}[0-9a-z][0-9]\N[0-9a-z]\N[0-9]\N[0-9]\N[0-9a-z]{4}\N[0-9]\.rts|[a-z]{2}[0-9a-z]{2}\N{2}[0-9a-z]{2}\N{2}[0-9][0-9a-z]{2}\N[0-9]{2}\.rts|S0{2}2_86F_JATD_01ZF\.rts)$/im;
$files{$2} = [$1];
}
my #stripfiles = keys %files;
my $cmd = "*****";
my #runEvents = `$cmd`;
chomp #runEvents;
foreach my $running(#runEvents) {
$running =~ s/^\{"blah 8[0-9a-z]{2}","(?<field2>CBE1_D{3}1_8EC_J6TG0{2}\.rts|[0-9a-z]{4}_8[0-9a-z]{2}_[0-9a-z]{2}[a-z][0-9a-z]0[0-9]\.rts|[a-z]{2}[0-9a-z]{2}\N{2}[0-9a-z]{2}\N{2}[0-9][0-9a-z]{2}\N[0-9]{2}\.rts)(?:",\{239,20,93,5\},310{2},20{3},run{2}ing\}|",\{239,20,93,5\},310{2},[0-9]{2}0{3},run{2}ing\}|",\{239,20,93,5\},310{2},[0-9]{3}0{4},run{2}ing\}|",\{239,20,93,5\},3[0-9]0{2},[0-9]{2}0{4},run{2}ing\})$/$+{field2}/img;
}
my #events = grep {my $x = $_; not grep {$x =~/\Q$_/i}#runEvents}#stripfiles;
foreach my $name (#events) {
my ($randnum, $fnames) = { $files{$name}};
my $combined = $randnum . $fnames;
print "Move $file from $orig_dir to $dest_dir";
move ("$orig_dir/$files{$name}", $dest_dir)
or warn "Can't move $file: $!";
}
#print scalar(grep $_, #stripfiles), "\n";
#returned 1626
#print scalar(grep $_, #runEvents), "\n";
#returned 102
#print scalar(grep $_, #allfiles), "\n";
#returned 1906
Once you are parsing filenames with regex there is no reason not to be able to capture all parts so that you can later reconstitute needed parts of the filename.
I assume that that overly long (and incomplete) regex does what it is meant to.
I am not sure how the files to move relate to the original files in #allfiles, since those are fetched from /var/user/data while your moving attempt uses /home/user/RunBackup. So code snippets below are more generic.
If what gets moved are precisely the files from #allfiles then just keep the file name
my %files;
foreach my $oldfile (#allfiles) {
$oldfile =~ m/...(...).../; # your regex, but capture the name
$files{$1} = $oldfile;
}
where by /...(...).../ I mean to indicate that you use your regex, but to which you add parenthesis around the part of the pattern that matches the name itself.
Then you can later retrieve the filename from the "name" of interest (cake_872_trucks).
If, however, the filename components may be needed to patch a different (while related) filename then capture and store the individual components
my %files;
foreach my $oldfile (#allfiles) {
$oldfile =~ m/(...)(...)(...)/; # your regex, just with capture groups
$files{$2} = [$1, $3]; # add to %files: name => [number, ext]
}
The regex only matches (why change names in #allfiles with s///?), and captures.
The first set of parenthesis captures that long leading factor (number) into $1, the second one gets the name (cake_872_trucks) into $2, and the third one has the extension, in $3.
So you end up with a hash with keys that are names of interest, with their values being arrayrefs with all other needed components of the filename. Please adjust as needed as I don't know what that regex does and may have missed some parts.
Now once you go through #events you can rebuild the name
use File::Copy qw(move);
foreach my $name (#events) {
my ($num, $ext) = #{ $files{$name} };
my $file = $num . $name . $ext;
say "Move $file from $orig_dir to $dest_dir";
move("$orig_dir/$file", $dest_dir) or warn "Can't move $file: $!";
}
But if the files to move are indeed from #allfiles (as would be the case in this example) then use the first version above to store filenames as values in %files and now retrieve them
foreach my $name (#events) {
move ("$orig_dir/$files{$name}", $dest_dir)
or warn "Can't move $file: $!";
}
I use the core module File::Copy, instead of going out to the system for the move command.
You can also rebuild the name by going through the directory again, now with names of interest on hand. But that'd be very expensive since you have to try to match every name in #events for every file read in the directory (O(mn) complexity).
What you asked about can be accomplished with glob (and note File::Glob's version)
my #files = glob "$dir/*${name}*";
but you'd have to do this for every $name -- a huge and needless waste of resources.
If that regex really must spell out specific numbers, here is a way to organize it for easier digestion (and debugging!): break it into reasonable parts, with a separate variable for each.
Ideally each part of alternation would be one variable
my $p1 = qr/.../;
my $p2 = qr/.../;
...
my $re_alt = join '|', $p1, $p2, ...;
my $re_other = qr/.../;
$var =~ m/^($re_alt)($re_other)(.*)$/; # adjust anchors, captures, etc
where the qr operator builds a regex pattern.
Adjust those capturing parenthesis, anchors, etc to your actual needs. Breaking it up so that the regex is sensibly split into variables will go a long way for readability, and thus correctness.
Assuming that there is a good reason to seek those specific numbers in filenames, this is also a good way to document any such fixed factors.
I guess you need something like this:
my $path = '/home/user/RunBackup/';
my #files = map {$path."*$_*"} #events;
system(join " ", "mv", #files, "/home/user/RunBackup/files/");
If there are lots of files you might need to move them one by one:
system(join " ", "mv", $_, "/home/user/RunBackup/files/") for #files;

"No Such File Error" when trying to open each fasta file stored in an array

How can I open each file in a folder in sequential order, perform a regex search on the contents of each file, and store the matches in another array?
Here is what I have so far:
#!/usr/bin/perl
use warnings;
use strict;
use diagnostics;
my $dir = ("/path/to/folder");
my #ArrayofFiles;
my #TrimmedSequences;
opendir( my $dh, $dir ) || die;
#make an array of fasta files from a folder
while ( readdir $dh ) {
chomp;
my $fileName = $_;
if ($fileName =~ /\.fasta.*/) {
push(#ArrayofFiles, $fileName);
}
}
#this diagnostic print statement shows that I do get the proper files into the target array. I leave it commented out when I run the script.
#print join("\n", #ArrayofFiles), "\n";
#now I want to open each file in the array, search file contents, and add the result to another array
foreach my $file (#ArrayofFiles){
open (my $sequence, '<', $file) or die $!;
while (my $line = <$sequence>) {
if ($line =~ m/(CTCCCA)[TAGC]+(TCAGGA)/) {
push(#TrimmedSequences, $line);
}
}
}
When I run this code, I get the following error message:
"Uncaught exception from user code: No such file or directory at /Users/roblogan/Documents/BIOL6309/Manipulating fast5 files/Attempt 5 line 23."
Line 24 is "open (my $sequence, '<', $file) or die $!;"
My diagnostic print statement shows that I am working with an array of the expected fasta files.
I would be very grateful for any help I can get. Thank you so much.
-Rob
#ArrayOfFiles just contains the filenames, it doesn't include the directory prefix. So you're trying to access the filenames in the current directory rather than the directory you listed.
Use:
push(#ArrayofFiles, "$dir/$fileName");
to get the full path.

Resources