uniquely rename each of many files using perl - arrays

I have a folder containing 96 files that I want to rename. The problem is that each file name needs a unique change...not like adding a zero the front of each name or changing extensions. It isn't practical to do a search and replace.
Here's a sample of the names I want to change:
newSEACODI-sww2320H-sww24_07A_CP.9_sww2320H_sww2403F.fsa
newSEACODI-sww2320H-sww24_07B_CP.10_sww2320H_sww2403F.fsa
newSEACODI-sww2320H-sww24_07C_CP.11_sww2320H_sww2403F.fsa
newSEACODI-sww2320H-sww24_07D_CP.12_sww2320H_sww2403F.fsa
newSEACODI-sww2320H-sww24_07E_R.1_sww2320H_sww2403F.fsa
newSEACODI-sww2320H-sww24_07F_R.3_sww2320H_sww2403F.fsa
newSEACODI-sww2320H-sww24_07G_R.4_sww2320H_sww2403F.fsa
newSEACODI-sww2320H-sww24_07H_R.5_sww2320H_sww2403F.fsa
I'd like to use perl to change the above names to the below names, respectively:
SEACODI_07A_A.2_sww2320H_2403F.fsa
SEACODI_07B_A.4_sww2320H_2403F.fsa
SEACODI_07C_H.1_sww2320H_2403F.fsa
SEACODI_07D_H.3_sww2320H_2403F.fsa
SEACODI_07E_H.6_sww2320H_2403F.fsa
SEACODI_07F_H.7_sww2320H_2403F.fsa
SEACODI_07G_Rb.4_sww2320H_2403F.fsa
SEACODI_07H_Rb.9_sww2320H_2403F.fsa
Can such a thing be done? I have a vague idea that I might make a text file with a list of the new names and call that list #newnames. I would make another array out of the current file names, and call it #oldnames. I'd then do some kind of for loop where each element $i in #oldnames is replaced by the corresponding $i in #newnames.
I don't know how to make an array out of my current file names, though, and so I'm not sure if this vague idea is on the right track. I keep my files with the messed-up names in a directory called 'oldnames'. The below is my attempt to make an array out of the file names in that directory:
#!/usr/bin/perl -w
use strict; use warnings;
my $dir = 'oldnames';
opendir ('oldnames', $dir) or die "cannot open dir $dir: $!";
my #file = readdir 'oldnames';
closedir 'oldnames';
print "#file\n";
The above didn't seem to do anything. I'm lost. Help?

Here:
#!/usr/bin/perl
use warnings;
use strict;
use autodie;
use File::Copy;
# capture script name, in case we are running the script from the
# same directory we working on.
my $this_file = (split(/\//, $0))[-1];
print "skipping file: $this_file\n";
my $oldnames = "/some/path/to/oldnames";
my $newnames = "/some/path/to/newnames";
# open the directory
opendir(my $dh, $oldnames);
# grep out all directories and possibly this script.
my #files_to_rename = grep { !-d && $_ ne $this_file } readdir $dh;
closedir $dh;
### UPDATED ###
# create hash of file names from lists:
my #old_filenames = qw(file1 file2 file3 file4);
my #new_filenames = qw(onefile twofile threefile fourfile);
my $filenames = create_hash_of_filenames(\#old_filenames, \#new_filenames);
my #missing_new_file = ();
# change directory, so we don't have to worry about pathing
# of files to rename and move...
chdir($oldnames);
mkdir($newnames) if !-e $newnames;
### UPDATED ###
for my $file (#files_to_rename) {
# Check that current file exists in the hash,
# if true, copy old file to new location with new name
if( exists($filenames->{$file}) ) {
copy($file, "$newnames/$filenames->{$file}");
} else {
push #missing_new_file, $file;
}
}
if( #missing_new_file ) {
print "Could not map files:\n",
join("\n", #missing_new_file), "\n";
}
# create_hash_of_filenames: creates a hash, where
# key = oldname, value = newname
# input: two array refs
# output: hash ref
sub create_hash_of_filenames {
my ($oldnames, $newnames) = #_;
my %filenames = ();
for my $i ( 0 .. (scalar(#$oldnames) - 1) ) {
$filenames{$$oldnames[$i]} = $$newnames[$i];
}
# see Dumper output below, to see data structure
return \%filenames;
}
Dumper result:
$VAR1 = {
'file2' => 'twofile',
'file1' => 'onefile',
'file4' => 'fourfile',
'file3' => 'threefile'
};
Running script:
$ ./test.pl
skipping file: test.pl
Could not map files:
a_file.txt
b_file.txt
c_file.txt
File result:
$ ls oldnames/
a_file.txt
b_file.txt
c_file.txt
file1
file2
file3
file4
$ ls newnames/
fourfile
onefile
threefile
twofile

Your code is a little odd, but it should work. Are you running it in the directory "oldnames" or in the directory above it? You should be in the directory above it. A more standard way of writing it would be like this:
#!/usr/bin/perl -w
use strict; use warnings;
my $dir = 'oldnames';
opendir ( my $oldnames, $dir) or die "cannot open dir $dir: $!";
my #file = readdir $oldnames;
closedir $oldnames;
print "#file\n";
This would populate #files with all the files in oldnames, including '.' and '..'. You might need to filter those out depending on how you do your renaming.

Can you do this with rename? It does allow you to use perl code and expressions as arguments if I recall.
The real answer is the one by #chrsblck it does some checks and doesn't make a mess.
For comparison here is a messy one liner that may suffice. It relies on you providing a list of equivalent new file names that will rename your list of old files in the correct order. Perhaps for your situation (where you don't want to do any programmatic transformation of the files names) you could just use a shell loop (see the end of this post) reading lists of new and old names from a file. A better perl solution would be to put both of these file name lists into two columns and then that file using the -a switch , #F and then useFile::Copy to copy the files around.
Anyway, below are some suggestions.
First, set things up:
% vim newfilenames.txt # list new names one per line corresponding to old names.
% wc -l newfilenames.txt # the same number of new names as files in ./oldfiles/
8 newfilenames.txt
% ls -1 oldfiles # 8 files rename these in order to list from newfilenames.txt
newSEACODI-sww2320H-sww24_07A_CP.9_sww2320H_sww2403F.fsa
newSEACODI-sww2320H-sww24_07B_CP.10_sww2320H_sww2403F.fsa
newSEACODI-sww2320H-sww24_07C_CP.11_sww2320H_sww2403F.fsa
newSEACODI-sww2320H-sww24_07D_CP.12_sww2320H_sww2403F.fsa
newSEACODI-sww2320H-sww24_07E_R.1_sww2320H_sww2403F.fsa
newSEACODI-sww2320H-sww24_07F_R.3_sww2320H_sww2403F.fsa
newSEACODI-sww2320H-sww24_07G_R.4_sww2320H_sww2403F.fsa
newSEACODI-sww2320H-sww24_07H_R.5_sww2320H_sww2403F.fsa
With files arranged as above, copy everything over:
perl -MFile::Copy -E 'opendir($dh , oldfiles); #newfiles=`cat newfilenames.txt`; chomp #newfiles; #oldfiles = sort grep(/^.+\..+$/, readdir $dh); END {for $i (0..$#oldfiles){copy("oldfiles/$oldfiles[$i]", "newfiles/$newfiles[$i]"); }}'
Not pretty: you have to grep andsort on #oldfiles to get rid of . .. and put the array elments in order. And there's always the risk that a typo could make a mess and it would be hard to figure out.
If you put the old and new names in a couple of files you could just do this with this with a shell script:
for i in `cat ../oldfilenames.txt` ; do ; done; for n in `cat ../newfilenames.txt`; do cp $i $n;
or just cd into the directory with the old files and do:
mkdir new
for i in * ; do ; done; for n in `cat ../newfilenames.txt`; do cp $i new/$n;
Good luck!

Related

Parsing unique data and renaming files

I was trying to create a Perl script to rename the files (hundreds of files with different names), but I have not had any success. I first need to find the unique file number and then rename it to something more human readable. Since file names are not sequential, it makes it difficult.
Examples of files names: The number of importance is after que sequence
# vv-- this number
lane8-s244-index--ATTACTCG-TATAGCCT-01_S244_L008_R1_001.fastq
lane8-s245-index--ATTACTCG-ATAGAGGC-02_S245_L008_R1_001.fastq
lane8-s246-index--TCCGGAGA-TATAGCCT-09_S246_L008_R1_001.fastq
lane8-s247-index--TCCGGAGA-ATAGAGGC-10_S247_L008_R1_001.fastq
lane8-s248-index--TCCGGAGA-CCTATCCT-11_S248_L008_R1_001.fastq
lane8-s249-index--TCCGGAGA-GGCTCTGA-12_S249_L008_R1_001.fastq
lane8-s250-index--TCCGGAGA-AGGCGAAG-13_S250_L008_R1_001.fastq
lane8-s251-index--TCCGGAGA-TAATCTTA-14_S251_L008_R1_001.fastq
lane7-s0007-index--ATTACTCG-TATAGCCT-193_S7_L007_R1_001.fastq
lane7-s0008-index--ATTACTCG-ATAGAGGC-105_S8_L007_R1_001.fastq
lane7-s0009-index--ATTACTCG-CCTATCCT-195_S9_L007_R1_001.fastq
lane7-s0010-index--ATTACTCG-GGCTCTGA-106_S10_L007_R1_001.fastq
lane7-s0011-index--ATTACTCG-AGGCGAAG-197_S11_L007_R1_001.fastq
lane7-s0096-index--AGCGATAG-CAGGACGT-287_S96_L007_R1_001.fastq
I have created a file called RENAMING_parse_data.sh that reference RENAMING_parse_data.pl
So in theory the idea is that it is parsing the data to find the sample # that is in the middle of the name, and taking that unique ID and renaming it. But I don't think it's even going into the IF loop.
Any ideas?
HERE IS THE .sh file that calls the perl scipt
#!/bin/bash
#first part is the program
#second is the directory path
#third and fourth times are the names of the output files
#./parse_data.pl /ACTF/Course/PATHTDIRECTORY Tabsummary.txt Strucsummary.txt
#WHERE ./parse_data.pl =name of the program
#WHERE /ACTF/Course/PATHTODIRECTORY = directory path were your field are saved AND is referred to as $dir_in = $ARGV[0] in the perl script;
#new files you recreating with the extracted data AND is refered to as $dir_in = $ARGV[1];
./RENAMING_parse_data.pl ./Test/ FishList.txt
HERE IS THE PERL SCRIP:
#!/usr/bin/perl
print (":)\n");
#Proesessing files in a directory
$dir_in = $ARGV[0];
$indv_list = $ARGV[1];
#open directory to acess those files, the folder where you have the files
opendir(DIR, $dir_in) || die ("Cannot open $dir_in");
#files = readdir(DIR);
#set all variables = 0 to void chaos
$j=0;
#open output header line for output file and print header line for tab delimited file
open(OUTFILETAB, ">", $indv_list);
print(OUTFILETAB "\t Fish ID", "\t");
#open each file
foreach (#files){
#re start all arrays to void chaos
print("in loop [$j]");
#acc_ID=();
#find FISH name
#EXAMPLE FISH NAMES: (lenth of fishname varies)
#lane8-s251-index--TCCGGAGA-TAATCTTA-14_S251_L008_R1_001.fastq.gz
#lane7-s0096-index--AGCGATAG-CAGGACGT-287_S96_L007_R1_001.final.fastq
#NOTE: what is in btween () is the ID that is printed NOTE that value can change from 2 -3 depending on Sample #
#Trials:
#lane[0-9]{1}-[a-z]{1}[0-9]{4}-index--[A-Z]{8}[A-Z]{8}-([0-9]{3})[a-z]{1}[0-9]{2}_[A-Z]{1}[0-9]{3}_[a-z]{1}[0-9]{1}_[0-9]{3}.fastq
#lane[0-9]{1}-[a-z]{1}[0-9]{4}-index--[A-Z]{8}[A-Z]{8}-([0-9]{3})*.fastq
#lane*([0-9]{3})*.fastq
#lane.*-([0-9]{2})_.*.fastq
#lane.*-([0-9]{2})_*.fastq
#lane[0-9]{1}-[a-z]{1}[0-9]{3}-index--[A-Z]{8}[A-Z]{8}-([0-9]{2})_[A-Z]{1}[0-9]{3}_L008_R1_001.fastq
$string_FISH = #files;
if ($string_FISH =~ /^lane[0-9]{1}-[a-z]{1}[0-9]{3}-index--[A-Z]{8}[A-Z]{8}-([0-9]{2})_[A-Z]{1}[0-9]{3}_L008_R1_001.fastq/){
$FISH_ID =$1;
#acc_ID[$j] = $FISH_ID;
#print ("FISH. = |$FISH_ID[$j]| \n");
rename($string_FISH, "FISH. = |$FISH_ID[$j]|");
#print ($acc_ID[$j], "\n");
print(OUTFILETAB "FISH. = |$FISH_ID[$j]| \n");
}
$j= $j+1;
}
IDEAL END RESULT
So in the end I would like it to take the file name, find the unique identifier and rename it
from :
lane8-s244-index--ATTACTCG-TATAGCCT-01_S244_L008_R1_001.fastq
lane7-s0007-index--ATTACTCG-TATAGCCT-193_S7_L007_R1_001.fastq
to:
Fish.01.fastq
Fish.193.fastq
Any Ideas or suggestion on hot to fix this or If it need to change completely are greatly appreciated.
At the core of a Perl solution, you could use
s/^.*-(\d+)_[^-]+(?=\.fastq\z)/Fish.$1/sa
For example,
$ ls -1 *.fastq
lane8-s244-index--ATTACTCG-TATAGCCT-01_S244_L008_R1_001.fastq
lane8-s245-index--ATTACTCG-ATAGAGGC-02_S245_L008_R1_001.fastq
lane8-s246-index--TCCGGAGA-TATAGCCT-09_S246_L008_R1_001.fastq
lane8-s247-index--TCCGGAGA-ATAGAGGC-10_S247_L008_R1_001.fastq
lane8-s248-index--TCCGGAGA-CCTATCCT-11_S248_L008_R1_001.fastq
lane8-s249-index--TCCGGAGA-GGCTCTGA-12_S249_L008_R1_001.fastq
$ rename 's/^.*-(\d+)_[^-]+(?=\.fastq\z)/Fish.$1/sa' *.fastq
$ ls -1 *.fastq
Fish.01.fastq
Fish.02.fastq
Fish.09.fastq
Fish.10.fastq
Fish.11.fastq
Fish.12.fastq
(There are two similar tools named rename. This one is also known as prename.)
It's pretty simple to implement yourself:
#!/usr/bin/perl
use strict;
use warnings;
my $errors = 0;
for (#ARGV) {
my $old = $_;
s/^.*-(\d+)_[^-]+(?=\.fastq\z)/Fish.$1/sa;
my $new = $_;
next if $new eq $old;
if ( -e $new ) {
warn( "Can't rename \"$old\" to \"$new\": Already exists\n" );
++$errors;
}
elsif ( !rename( $old, $new ) ) {
warn( "Can't rename \"$old\" to \"$new\": $!\n" );
++$errors;
}
}
exit( !!$errors );
Provide the files to rename as arguments (e.g. using *.fastq from the shell).
$ ls -1 *.fastq
lane8-s244-index--ATTACTCG-TATAGCCT-01_S244_L008_R1_001.fastq
lane8-s245-index--ATTACTCG-ATAGAGGC-02_S245_L008_R1_001.fastq
lane8-s246-index--TCCGGAGA-TATAGCCT-09_S246_L008_R1_001.fastq
lane8-s247-index--TCCGGAGA-ATAGAGGC-10_S247_L008_R1_001.fastq
lane8-s248-index--TCCGGAGA-CCTATCCT-11_S248_L008_R1_001.fastq
lane8-s249-index--TCCGGAGA-GGCTCTGA-12_S249_L008_R1_001.fastq
$ ./a *.fastq
$ ls -1 *.fastq
Fish.01.fastq
Fish.02.fastq
Fish.09.fastq
Fish.10.fastq
Fish.11.fastq
Fish.12.fastq
The existence check (-e) is to prevent accidentally renaming a bunch of files to the same name and therefore losing all but one of them.
The above is an cleaned up version of an one-liner pattern I often use.
dir /b ... | perl -nle"$o=$_; s/.../.../; $n=$_; rename$o,$n if!-e$n"
Adapted to sh:
\ls ... | perl -nle'$o=$_; s/.../.../; $n=$_; rename$o,$n if!-e$n'

Creating array of file names using grep

I'm having difficulty outputting file names as an array using grep. Specifically, I want to create an array of file names (plant photos) formatted like this:
Ilex_verticillata= Ilex_verticillata1.png, Ilex_verticillata2.png, Ilex_verticillata3.png
Asarum_canadense= Asarum_canadense1.png, Asarum_canadense2.png
Ageratina_altissi= Ageratina_altissi1.png, Ageratina_altissi2.png
Here's my original Perl script that I'm attempting to modify. It returns, as intended, ONE file name per plant as "Genus_species", printing a list of those plants:
#!/usr/bin/perl
use strict;
use warnings;
my $dir = '/Users/jdm/Desktop/xampp/htdocs/cnc/images/plants';
opendir my $dfh, $dir or die "Can't open $dir: $!";
my #files =
map { s/1\.png\z/.png/r } # Removes "1" from end of file names
grep { /^[^2-9]*\.png\z/i && /_/ } # Finds "Genus_species.png" & "Genus_species1.png" and returns one file name per plant as "Genus_species.png"
readdir $dfh;
foreach my$file (#files) {
$file =~s/\.png//; # Removes ".png" extension
print "$file\n"; #Prints list of file names (plant names)
}
Here's the output:
Ilex_verticillata
Asarum_canadense
Ageratina_altissima
However, since each plant often has MULTIPLE photos (e.g.-- "Genus_species1.png, Genus_species2.png, etc.), I need to re-grep the directory using the above output to find their file names, then output the results in the form of an array as previously illustrated.
I know the solution likely involves modifying the "foreach" statement, using grep to return ALL file names with "Genus_species" in their name. Here's what I tried:
foreach my$file (#files) {
$file =~s/\.png//;
grep ($file,readdir(DIR));
print "$file = $file\n";
But the output was this:
Ilex_verticillata = Ilex_verticillata
Asarum_canadense = Asarum_canadense
Ageratina_altissima = Ageratina_altissima
Again, I want to output an array formatted as:
"Genus_species= Genus_species1.png, Genus_species2.png, etc.," meaning I want it to look like this:
Ilex_verticillata= Ilex_verticillata1.png, Ilex_verticillata2.png, Ilex_verticillata3.png
Asarum_canadense= Asarum_canadense1.png, Asarum_canadense2.png
Ageratina_altissi= Ageratina_altissi1.png, Ageratina_altissi2.png
Notice that I also want to add back the ".png" extension ONLY to the file names to the right of the equals sign.
Please advise. Thanks.
Readdir returns a list of files in the folder. You've put them on one line, which is compact. However, if you loop them you can process the items further.
#!/usr/bin/perl
use strict;
use warnings;
use English; ## use names rather than symbols for special varables
my $dir = '/Users/jdm/Desktop/xampp/htdocs/cnc/images/plants';
opendir my $dfh, $dir or die "Can't open $dir: $OS_ERROR";
my %genus_species; ## store matching entries in a hash
for my $file (readdir $dfh)
{
next unless $file =~ /\d\.png$/; ## skip entry if not a png file ending with a number
my $genus = $file =~ s/\d\.png$//r;
push(#{$genus_species{$genus}}, $file); ## push to array,the #{} is to cast the single entry to a referance to an list
}
for my $genus (keys %genus_species)
{
print "$genus = ";
print "$_ " for sort #{$genus_species{$genus}}; # sort and loop though entries in list referance
print "\n";
}

Perl combine multiple file contents to single file

I have multiple log files say file1.log file2.log file3.log etc. I want to combine these files contents and put it into single file called result_file.log
Is there any Perl module which can achieve this?
Update: Here is my code
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use File::Copy;
my #files;
my $dir = "/path/to/directory";
opendir(DIR, $dir) or die $!;
while (my $file = readdir(DIR)) {
# We only want files
next unless (-f "$dir/$file");
# Use a regular expression to find files ending in .log
next unless ($file =~ m/\.log$/);
print "$file\n";
push( #files, $file);
}
closedir(DIR);
print Dumper(\#files);
open my $out_file, ">result_file.log" ;
copy($_, $out_file) foreach ( #files );
exit 0;
Do you think it is feasible solution?
CPAN 'File::Copy' should do the work, you will have to open the output file youself.
use File::Copy ;
open my $out, ">result.log" ;
copy($_, $out) foreach ('file1.log', 'file2.log', );
close $out ;
Update 1:
Based on additional information posted to answer, looks like the ask is to concatenate (in Perl) list of files match a pattern (*.log). Below extends the above solution to include additional logic, using glob, avoiding the readdir and filtering.
use File::Copy ;
open my $out, ">result.log" ;
copy($_, $out) foreach glob('/path/to/dir/*.log' );
close $out ;
Important notes:
* Using glob will SORT the file name alphabetically, while readdir does NOT guarantee any order.
* The output file 'result.log' match '*.log', should not execute the code in the current directory.
Do you think it is feasible solution?
I'm afraid not. Your code is the equivalent of typing these commands at your prompt:
$ cp file1.log result_file.log
$ cp file2.log result_file.log
$ cp file3.log result_file.log
$ ... etc ...
The problem with this is that it copies each file, in turn over the top of the previous one. So you end up with a copy of the final file in the list.
As I said in a comment, this is most easily done using cay - no need for Perl at all.
$ cat file1.log file2.log file3.log > result_file.log
If you really want to do it in Perl, then something like this would work (the first section is rather similar to yours).
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #files;
my $dir = "/path/to/directory";
opendir(my $dh, $dir) or die $!;
while (my $file = readdir($dh)) {
# We only want files
next unless (-f "$dir/$file");
# Use a regular expression to find files ending in .log
next unless ($file =~ m/\.log$/);
print "$file\n";
push( #files, "$dir/$file");
}
closedir($dh);
print Dumper(\#files);
open my $out_file, '>', 'result_file.log';
foreach my $fn (#files) {
open my $in_file, '<', $fn or die "$fn: $!";
print $out_file while <$fn>);
}

Regexp to Compare partial filenames then moving to another directory perl

I am working on a script to compare non-running files within a dir to running files from a command. I have to use Regex to strip the front half of the filenames from the dir then regex to strip the filenames out of a command which then records the unmatched names into an array.
The part I cannot figure out is how I can move the filenames from the old dir into a new directory for future deletion.
In order to move the files I will need to enclose them in wildcards, * due to the random numbers in front of the filenames and the extention.
example filenames before and after:
within dir:
13209811124300209156562070_cake_872_trucks.rts
within command:
{"file 872","cake_872_trucks.rts",running}
in #events array:
cake_872_trucks
My code:
#!/usr/bin/perl -w
use strict;
use warnings;
use File::Copy qw(move);
use Data::Dumper;
use List::Util 'max';
my $orig_dir = "/var/user/data/";
my $dest_dir = "/var/user/data/DeleteMe/";
my $dir = "/var/user/data";
opendir(DIR, $dir) or die "Could not open $dir: $!\n";
my #allfiles = readdir DIR;
close DIR;
my %files;
foreach my $allfiles(#allfiles) {
$allfiles =~ m/^(13{2}638752056463{2}635181_|1[0-9]{22}_|1[0-9]{23}_|1[0-9]{24}_|1[0-9]{25}_)([0-9a-z]{4}_8[0-9a-z]{2}_[0-9a-z]{2}[a-z][0-9a-z]0[0-9]\.rts|[a-z][0-9a-z]{3}_[0-9a-z]{4}_8[0-9a-z]{2}_[0-9a-z]{2}[a-z]{2}0[0-9]\.rts|[a-z]{2}[0-9a-z][0-9]\N[0-9a-z]\N[0-9]\N[0-9]\N[0-9a-z]{4}\N[0-9]\.rts|[a-z]{2}[0-9a-z]{2}\N{2}[0-9a-z]{2}\N{2}[0-9][0-9a-z]{2}\N[0-9]{2}\.rts|S0{2}2_86F_JATD_01ZF\.rts)$/im;
$files{$2} = [$1];
}
my #stripfiles = keys %files;
my $cmd = "*****";
my #runEvents = `$cmd`;
chomp #runEvents;
foreach my $running(#runEvents) {
$running =~ s/^\{"blah 8[0-9a-z]{2}","(?<field2>CBE1_D{3}1_8EC_J6TG0{2}\.rts|[0-9a-z]{4}_8[0-9a-z]{2}_[0-9a-z]{2}[a-z][0-9a-z]0[0-9]\.rts|[a-z]{2}[0-9a-z]{2}\N{2}[0-9a-z]{2}\N{2}[0-9][0-9a-z]{2}\N[0-9]{2}\.rts)(?:",\{239,20,93,5\},310{2},20{3},run{2}ing\}|",\{239,20,93,5\},310{2},[0-9]{2}0{3},run{2}ing\}|",\{239,20,93,5\},310{2},[0-9]{3}0{4},run{2}ing\}|",\{239,20,93,5\},3[0-9]0{2},[0-9]{2}0{4},run{2}ing\})$/$+{field2}/img;
}
my #events = grep {my $x = $_; not grep {$x =~/\Q$_/i}#runEvents}#stripfiles;
foreach my $name (#events) {
my ($randnum, $fnames) = { $files{$name}};
my $combined = $randnum . $fnames;
print "Move $file from $orig_dir to $dest_dir";
move ("$orig_dir/$files{$name}", $dest_dir)
or warn "Can't move $file: $!";
}
#print scalar(grep $_, #stripfiles), "\n";
#returned 1626
#print scalar(grep $_, #runEvents), "\n";
#returned 102
#print scalar(grep $_, #allfiles), "\n";
#returned 1906
Once you are parsing filenames with regex there is no reason not to be able to capture all parts so that you can later reconstitute needed parts of the filename.
I assume that that overly long (and incomplete) regex does what it is meant to.
I am not sure how the files to move relate to the original files in #allfiles, since those are fetched from /var/user/data while your moving attempt uses /home/user/RunBackup. So code snippets below are more generic.
If what gets moved are precisely the files from #allfiles then just keep the file name
my %files;
foreach my $oldfile (#allfiles) {
$oldfile =~ m/...(...).../; # your regex, but capture the name
$files{$1} = $oldfile;
}
where by /...(...).../ I mean to indicate that you use your regex, but to which you add parenthesis around the part of the pattern that matches the name itself.
Then you can later retrieve the filename from the "name" of interest (cake_872_trucks).
If, however, the filename components may be needed to patch a different (while related) filename then capture and store the individual components
my %files;
foreach my $oldfile (#allfiles) {
$oldfile =~ m/(...)(...)(...)/; # your regex, just with capture groups
$files{$2} = [$1, $3]; # add to %files: name => [number, ext]
}
The regex only matches (why change names in #allfiles with s///?), and captures.
The first set of parenthesis captures that long leading factor (number) into $1, the second one gets the name (cake_872_trucks) into $2, and the third one has the extension, in $3.
So you end up with a hash with keys that are names of interest, with their values being arrayrefs with all other needed components of the filename. Please adjust as needed as I don't know what that regex does and may have missed some parts.
Now once you go through #events you can rebuild the name
use File::Copy qw(move);
foreach my $name (#events) {
my ($num, $ext) = #{ $files{$name} };
my $file = $num . $name . $ext;
say "Move $file from $orig_dir to $dest_dir";
move("$orig_dir/$file", $dest_dir) or warn "Can't move $file: $!";
}
But if the files to move are indeed from #allfiles (as would be the case in this example) then use the first version above to store filenames as values in %files and now retrieve them
foreach my $name (#events) {
move ("$orig_dir/$files{$name}", $dest_dir)
or warn "Can't move $file: $!";
}
I use the core module File::Copy, instead of going out to the system for the move command.
You can also rebuild the name by going through the directory again, now with names of interest on hand. But that'd be very expensive since you have to try to match every name in #events for every file read in the directory (O(mn) complexity).
What you asked about can be accomplished with glob (and note File::Glob's version)
my #files = glob "$dir/*${name}*";
but you'd have to do this for every $name -- a huge and needless waste of resources.
If that regex really must spell out specific numbers, here is a way to organize it for easier digestion (and debugging!): break it into reasonable parts, with a separate variable for each.
Ideally each part of alternation would be one variable
my $p1 = qr/.../;
my $p2 = qr/.../;
...
my $re_alt = join '|', $p1, $p2, ...;
my $re_other = qr/.../;
$var =~ m/^($re_alt)($re_other)(.*)$/; # adjust anchors, captures, etc
where the qr operator builds a regex pattern.
Adjust those capturing parenthesis, anchors, etc to your actual needs. Breaking it up so that the regex is sensibly split into variables will go a long way for readability, and thus correctness.
Assuming that there is a good reason to seek those specific numbers in filenames, this is also a good way to document any such fixed factors.
I guess you need something like this:
my $path = '/home/user/RunBackup/';
my #files = map {$path."*$_*"} #events;
system(join " ", "mv", #files, "/home/user/RunBackup/files/");
If there are lots of files you might need to move them one by one:
system(join " ", "mv", $_, "/home/user/RunBackup/files/") for #files;

Call upon specific elements from array

Ok, so I have a bunch of file names possessing one of the following two formats:
Sample-ID_Adapter-Sequence_L001_R1_001.fastq (As Forward)
Sample-ID_Adapter-Sequence_L001_R2_001.fastq (As Reverse)
The only difference between the forward and reverse formats is the R1 and R2 elements in the filename. Now, I've managed to enable the user to provide the directory containing these files with the following script:
#!/usr/bin/perl
use strict;
use warnings;
#Print Directory
print "Please provide the directory containing the FASTQ files from your Illumina MiSeq run \n";
my $FASTQ = <STDIN>;
chomp ($FASTQ);
#Open Directory
my $dir = $FASTQ;
opendir(DIR, $dir) or die "Cannot open $dir: $!";
my #forwardreads = grep { /R1_001.fastq/ } readdir DIR;
closedir DIR;
my $direct = $FASTQ;
opendir(DIR, $direct) or die "Cannot open $dir: $!";
my #reversereads = grep { /R2_001.fastq/ } readdir DIR;
closedir DIR;
foreach my $ffile (#forwardreads) {
my $forward = $ffile;
print $forward;
}
foreach my $rfile (#reversereads) {
my $reverse = $rfile;
print $reverse;
}
The Problem
What I want to do with the above script is to find a way to pair up the elements of both arrays that are derived from the same Sample ID. Like I said, the only difference between the forward and reverse files (from the same sample ID) would be the R1 and the R2 parts of the file name.
I've tried looking up ways to extract elements from an array, but I want to let the program do the matching instead of me.
Thanks for reading and I hope you guys can help!
You'll have to parse out the filename. Fortunately, this is pretty straightforward. After stripping the extension, you can split the pieces on _.
# Strip the file extension.
my($suffix) = $filename =~ s{\.(.*?)$}{};
# Parse Sample-ID_Adapter-Sequence_L001_R1_001
my($sample_id, $adapter_sequence, $uhh, $format, $yeah) = split /_/, $filename;
Now you can do what you like with them.
I'd suggest a few things to improve the code. First, put that filename parsing into a function so it can be reused and to keep the main code simpler. Second, parse the filenames into a hash rather than a bunch of scalars, it'll be easier to work with and pass around. Finally, include the filename itself in that hash, then the hash contains the complete data. This, btw, is a gateway drug to OO programming.
sub parse_fastq_filename {
# Read the next (in this case first and only) argument.
my $filename = shift;
# Strip the suffix
my($suffix) = $filename =~ s{\.(.*?)$}{};
# Parse Sample-ID_Adapter-Sequence_L001_R1_001
my($sample_id, $adapter_sequence, $uhh, $format, $yeah) = split /_/, $filename;
return {
filename => $filename,
sample_id => $sample_id,
adapter_sequence => $adapter_sequence,
uhh => $uhh,
format => $format,
yeah => $yeah
};
}
Then rather than finding left and right formatted files separately, process everything in one loop. Put matching left and right pairs in a hash. Use glob to pick up only the .fastq files.
# This is where the pairs of files will be stored.
my %pairs;
# List just the *.fastq files
while( my $filename = glob("$FASTQ_DIR/*.fastq")) {
# Parse the filename into a hash reference
my $fastq = parse_fastq_filename($filename);
# Put each parsed fastq filename into its pair
$pairs{ $fastq->{sample_id} }{ $fastq->{format} } = $fastq;
}
Then you can do what you like with %pairs. Here's an example to print out each sample ID and what formats it has.
# Iterate through each sample and pair.
# $sample is a hash ref of format pairs
for my $sample (values %pairs) {
# Now iterate through each pair in the sample
for my $fastq (values %$sample) {
say "$fastq->{sample_id} has format $fastq->{format}";
}
}

Resources