Perl regex capture grouping

Perl regex capture grouping - arrays

Is there a better way to write the below code using a regex capture grouping?
I'm looking to get the folder name immediately after Recordings.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $path1 = '\\server1\Recordings\AU-AR-Online\Outbound\20160120\52031\52031-14-07-16.wav';
my $path2 = '\\server2\e$\Recordings\SU-AC-Online\Outbound\20160120\52031\52031-14-07-16.wav';
my #paths = ( $path1,$path2 );
foreach my $path (#paths) {
# Split path into fields
my #array = (split /\\/, $path);
# Get index of Recordings
my( $index )= grep { $array[$_] eq "Recordings" } 0..$#array;
# Brand always follows Recordings
print $array[$index+1];
}

Sure, just replace the content of your loop with this:
my ($brand) = $path =~ m|\\Recordings\\([^\\]+)| or die "Not found";
print $brand;

Capture the folder directly after Recordings\
my ($brand) = $path =~ m{ Recordings \\ ( [^\\] + ) }x )
Using the x modifier on the regex means whitespace is ignore, which can help make the regex more readable.
If the brand folder is always 5th-last, you could split the path and grab it by negative indexing.
my $brand = (split /\\/, $path)[-5];
But again, that only works if the brand is always 5th last. I don't know what your data set is.
Also, if your working with paths, there are many modules (such as Path::Tiny) that make it easier to get parent/child/absolute paths, basenames, etc.

Related

Regexp to Compare partial filenames then moving to another directory perl

I am working on a script to compare non-running files within a dir to running files from a command. I have to use Regex to strip the front half of the filenames from the dir then regex to strip the filenames out of a command which then records the unmatched names into an array.
The part I cannot figure out is how I can move the filenames from the old dir into a new directory for future deletion.
In order to move the files I will need to enclose them in wildcards, * due to the random numbers in front of the filenames and the extention.
example filenames before and after:
within dir:
13209811124300209156562070_cake_872_trucks.rts
within command:
{"file 872","cake_872_trucks.rts",running}
in #events array:
cake_872_trucks
My code:
#!/usr/bin/perl -w
use strict;
use warnings;
use File::Copy qw(move);
use Data::Dumper;
use List::Util 'max';
my $orig_dir = "/var/user/data/";
my $dest_dir = "/var/user/data/DeleteMe/";
my $dir = "/var/user/data";
opendir(DIR, $dir) or die "Could not open $dir: $!\n";
my #allfiles = readdir DIR;
close DIR;
my %files;
foreach my $allfiles(#allfiles) {
$allfiles =~ m/^(13{2}638752056463{2}635181_|1[0-9]{22}_|1[0-9]{23}_|1[0-9]{24}_|1[0-9]{25}_)([0-9a-z]{4}_8[0-9a-z]{2}_[0-9a-z]{2}[a-z][0-9a-z]0[0-9]\.rts|[a-z][0-9a-z]{3}_[0-9a-z]{4}_8[0-9a-z]{2}_[0-9a-z]{2}[a-z]{2}0[0-9]\.rts|[a-z]{2}[0-9a-z][0-9]\N[0-9a-z]\N[0-9]\N[0-9]\N[0-9a-z]{4}\N[0-9]\.rts|[a-z]{2}[0-9a-z]{2}\N{2}[0-9a-z]{2}\N{2}[0-9][0-9a-z]{2}\N[0-9]{2}\.rts|S0{2}2_86F_JATD_01ZF\.rts)$/im;
$files{$2} = [$1];
}
my #stripfiles = keys %files;
my $cmd = "*****";
my #runEvents = `$cmd`;
chomp #runEvents;
foreach my $running(#runEvents) {
$running =~ s/^\{"blah 8[0-9a-z]{2}","(?<field2>CBE1_D{3}1_8EC_J6TG0{2}\.rts|[0-9a-z]{4}_8[0-9a-z]{2}_[0-9a-z]{2}[a-z][0-9a-z]0[0-9]\.rts|[a-z]{2}[0-9a-z]{2}\N{2}[0-9a-z]{2}\N{2}[0-9][0-9a-z]{2}\N[0-9]{2}\.rts)(?:",\{239,20,93,5\},310{2},20{3},run{2}ing\}|",\{239,20,93,5\},310{2},[0-9]{2}0{3},run{2}ing\}|",\{239,20,93,5\},310{2},[0-9]{3}0{4},run{2}ing\}|",\{239,20,93,5\},3[0-9]0{2},[0-9]{2}0{4},run{2}ing\})$/$+{field2}/img;
}
my #events = grep {my $x = $_; not grep {$x =~/\Q$_/i}#runEvents}#stripfiles;
foreach my $name (#events) {
my ($randnum, $fnames) = { $files{$name}};
my $combined = $randnum . $fnames;
print "Move $file from $orig_dir to $dest_dir";
move ("$orig_dir/$files{$name}", $dest_dir)
or warn "Can't move $file: $!";
}
#print scalar(grep $_, #stripfiles), "\n";
#returned 1626
#print scalar(grep $_, #runEvents), "\n";
#returned 102
#print scalar(grep $_, #allfiles), "\n";
#returned 1906

Once you are parsing filenames with regex there is no reason not to be able to capture all parts so that you can later reconstitute needed parts of the filename.
I assume that that overly long (and incomplete) regex does what it is meant to.
I am not sure how the files to move relate to the original files in #allfiles, since those are fetched from /var/user/data while your moving attempt uses /home/user/RunBackup. So code snippets below are more generic.
If what gets moved are precisely the files from #allfiles then just keep the file name
my %files;
foreach my $oldfile (#allfiles) {
$oldfile =~ m/...(...).../; # your regex, but capture the name
$files{$1} = $oldfile;
}
where by /...(...).../ I mean to indicate that you use your regex, but to which you add parenthesis around the part of the pattern that matches the name itself.
Then you can later retrieve the filename from the "name" of interest (cake_872_trucks).
If, however, the filename components may be needed to patch a different (while related) filename then capture and store the individual components
my %files;
foreach my $oldfile (#allfiles) {
$oldfile =~ m/(...)(...)(...)/; # your regex, just with capture groups
$files{$2} = [$1, $3]; # add to %files: name => [number, ext]
}
The regex only matches (why change names in #allfiles with s///?), and captures.
The first set of parenthesis captures that long leading factor (number) into $1, the second one gets the name (cake_872_trucks) into $2, and the third one has the extension, in $3.
So you end up with a hash with keys that are names of interest, with their values being arrayrefs with all other needed components of the filename. Please adjust as needed as I don't know what that regex does and may have missed some parts.
Now once you go through #events you can rebuild the name
use File::Copy qw(move);
foreach my $name (#events) {
my ($num, $ext) = #{ $files{$name} };
my $file = $num . $name . $ext;
say "Move $file from $orig_dir to $dest_dir";
move("$orig_dir/$file", $dest_dir) or warn "Can't move $file: $!";
}
But if the files to move are indeed from #allfiles (as would be the case in this example) then use the first version above to store filenames as values in %files and now retrieve them
foreach my $name (#events) {
move ("$orig_dir/$files{$name}", $dest_dir)
or warn "Can't move $file: $!";
}
I use the core module File::Copy, instead of going out to the system for the move command.
You can also rebuild the name by going through the directory again, now with names of interest on hand. But that'd be very expensive since you have to try to match every name in #events for every file read in the directory (O(mn) complexity).
What you asked about can be accomplished with glob (and note File::Glob's version)
my #files = glob "$dir/*${name}*";
but you'd have to do this for every $name -- a huge and needless waste of resources.
If that regex really must spell out specific numbers, here is a way to organize it for easier digestion (and debugging!): break it into reasonable parts, with a separate variable for each.
Ideally each part of alternation would be one variable
my $p1 = qr/.../;
my $p2 = qr/.../;
...
my $re_alt = join '|', $p1, $p2, ...;
my $re_other = qr/.../;
$var =~ m/^($re_alt)($re_other)(.*)$/; # adjust anchors, captures, etc
where the qr operator builds a regex pattern.
Adjust those capturing parenthesis, anchors, etc to your actual needs. Breaking it up so that the regex is sensibly split into variables will go a long way for readability, and thus correctness.
Assuming that there is a good reason to seek those specific numbers in filenames, this is also a good way to document any such fixed factors.

I guess you need something like this:
my $path = '/home/user/RunBackup/';
my #files = map {$path."*$_*"} #events;
system(join " ", "mv", #files, "/home/user/RunBackup/files/");
If there are lots of files you might need to move them one by one:
system(join " ", "mv", $_, "/home/user/RunBackup/files/") for #files;

How to search for overlapping matches for a regex pattern within a string

I have this string
my $line = "MZEFSRGGRMEAZFE*MQZEFFMAEZF*"
and I want to find every substring starting with M and ending with * and add it to an array. This means that the above string would give me 6 elements in my array.
I have this code
foreach ( $line =~ m/M.*?\*/g ) {
push #ORF, $_;
}
but it only gives me two elements in my array since it ignores overlapping strings.
Is there any way to get all matches? I tried googling but could not find an answer.

Can use code within re and Backtracking control verbs for a little magic:
#!/usr/bin/env perl
use strict;
use warnings;
my $line = "MZEFSRGGRMEAZFE*MQZEFFMAEZF*";
local our #match;
$line =~ m/(M.*\*)(?{ push #match, $1 })(*FAIL)/;
use Data::Dump;
dd #match;
Outputs:
(
"MZEFSRGGRMEAZFE*MQZEFFMAEZF*",
"MZEFSRGGRMEAZFE*",
"MEAZFE*MQZEFFMAEZF*",
"MEAZFE*",
"MQZEFFMAEZF*",
"MAEZF*",
)

I don't believe it's possible to create a single regex pattern that will match all such substrings, because you're asking for both a greedy and a non-greedy match at the same time, and everything else in-between
I suggest you store all possible start and end positions of these substrings and use a double loop to combine all start positions with all end positions
This program demonstrates
use strict;
use warnings 'all';
use feature 'say';
my $line = 'MZEFSRGGRMEAZFE*MQZEFFMAEZF*';
my #orf;
{
my (#s, #e);
push #s, $-[0] while $line =~/M/g;
push #e, $+[0] while $line =~/\*/g;
for my $s ( #s ) {
for my $e ( #e ) {
push #orf, substr $line, $s, $e-$s if $e > $s;
}
}
}
say for #orf;
output
MZEFSRGGRMEAZFE*
MZEFSRGGRMEAZFE*MQZEFFMAEZF*
MEAZFE*
MEAZFE*MQZEFFMAEZF*
MQZEFFMAEZF*
MAEZF*

Using perl arrays to compare file names to find matching mp3 and m4a file types

I've populated an array with all files with the file extension of mp3 and a separate array with all m4a files as listed below
my #mp3filesarray = grep ( -f ,<*.mp3>);
my #m4afilesarray = grep ( -f ,<*.m4a>);
What I would like to do is compare the file names of the arrays to see if there is a match or partial match and, if there is, copy both the mp3 file and the m4a file to a new sub directory so that I may review the files to determine which file I want to keep. I am sure that I need to use a regex for this but am not sure how to this. I would appreciate any help with this. Thanks.

Here's how I'd approach this.
use strict;
use warnings;
use File::Path qw(make_path);
use File::Copy qw(move);
my %seen;
while ( my $file = glob '*.{mp3,m4a}' )
{
++$seen{ substr($file, 0, length() - 4) };
}
for my $dupe ( grep { $seen{$_} > 1 } keys %seen )
{
make_path($dupe);
move("$dupe.$_", "$dupe/$dupe.$_" for (qw(mp3 m4a)); # Change / to \ if you're on Windows
}
I start by globbing all files ending in m4a or mp3, then strip it down to the basename without an extension and hash it. Then I loop over any duplicates and move them into their own folder.
A regular expression would be overkill, since the glob expansion syntax is much more restrictive than a regular expression.
This approach only finds duplicates where the only difference is the file extension, however. To do fuzzy matching you'd need a different technique than the O(n) hashing strategy I used.
The first while loop could also be written using File::Basename::fileparse(), like so:
while (my $file = glob '*.{mp3,m4a}')
{
my $name = fileparse($file, qr/ [.] [^.]* \z/x);
++$seen{$name};
}

What you're trying to do is quite expensive - in order to look for partial matches, you need to compare each filename with each other filename. Presumably you don't want to compare the mp3 list with each other, which makes it a little easier.
I'd go with a foreach loop:
my %files;
foreach my $file ( glob ( '*.mp3 *.m4a' ) {
my ( $name, $type ) = ( $file =~ m/(\w+)\.(m[4p][a3])/ );
$files{$type}{$name}++;
}
foreach my $mp3_file ( keys %{ $files{'mp3'} } ) {
if ( $files{'m4a'} ) { print "Dupe detected: mp3_file\n"; next; }
foreach my $m4a_file ( keys %{ $files{'m4a'} } ) {
if ( $mp3_file =~ m/\Q$m4a_file/ ) { print "Partial match $mp3_file $m4a_file\n"; }
if ( $m4a_file =~ m/\Q$mp3_file/ ) { print "Partial match $m4a_file $mp3_file\n"; }
}
}
Something like that - you're doing a straight substring comparison on the file name - without extension. You'll want to do the same with the m4a to mp3 compare. (You could simplify a bit if you're prepared to compare each file with each file independent of extension, but you also increase the number of comparisons... and of course, you're guaranteed at least one duplicate :))

if you dont have any numbers in file (like 1.mp3 or a12b.m4a) then this would work:
use strict;
use warnings;
system("sudo mkdir review");
my (#spmp3,#spm4a);
my #mp3file=`ls | grep mp3`;
my #m4afile=`ls | grep m4a`;
for (my $i=0; $i<=$#mp3file; $i++)
{
#spmp3 = split (/\./, $mp3file[$i]);
}
for (my $j=0; $j<=$#m4afile; $j++)
{
push (#spm4a, split (/\./, $m4afile[$j]));
}
for (my $k=0; $k<=$#mp3file; $k=$k+2)
{
for (my $l=0; $l<=$#m4afile; $l=$l+2)
{
if ( $spmp3[$k] eq $spm4a[$l] )
{
system(" mv $spmp3[$k].mp3 $spm4a[$k].m4a ./review");
}
}
}

Checking for Duplicates in array

What's going on:
I've ssh'd onto my localhost, ls the desktop and taken those items and put them into an array.
I hardcoded a short list of items and I am comparing them with a hash to see if anything is missing from the host (See if something from a is NOT in b, and let me know).
So after figuring that out, when I print out the "missing files" I get a bunch of duplicates (see below), not sure if that has to do with how the files are being checked in the loop, but I figured the best thing to do would be to just sort out the data and eliminate dupes.
When I do that, and print out the fixed data, only one file is printing, two are missing.
Any idea why?
#!/usr/bin/perl
my $hostname = $ARGV[0];
my #hostFiles = ("filecheck.pl", "hostscript.pl", "awesomeness.txt");
my #output =`ssh $hostname "cd Desktop; ls -a"`;
my %comparison;
for my $file (#hostFiles) {
$comparison{$file} +=1;
}
for my $file (#output) {
$comparison{$file} +=2
}
for my $file (sort keys %comparison) {
#missing = "$file\n" if $comparison{$file} ==1;
#print "Extra file: $file\n" if $comparison{$file} ==2;
print #missing;
}
my #checkedMissingFiles;
foreach my $var ( #missing ){
if ( ! grep( /$var/, #checkedMissingFiles) ){
push( #checkedMissingFiles, $var );
}
}
print "\n\nThe missing Files without dups:\n #checkedMissingFiles\n";
Password:
awesomeness.txt ##This is what is printing after comparing the two arrays
awesomeness.txt
filecheck.pl
filecheck.pl
filecheck.pl
hostscript.pl
hostscript.pl
The missing Files without dups: ##what prints after weeding out duplicates
hostscript.pl

The perl way of doing this would be:
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my %hostFiles = qw( filecheck.pl 1 hostscript.pl 1 awesomeness.txt 1);
# ssh + backticks + ls, not the greatest way to do this, but that's another Q
my #files =`ssh $ARGV[0] "ls -a ~/Desktop"`;
# get rid of the newlines
chomp #files;
#grep returns the matching element of #files
my %existing = map { $_ => 1} grep {exists($hostFiles{$_})} #files;
print Dumper([grep { !exists($existing{$_})} keys %hostFiles]);
Data::Dumper is a utility module, I use it for debugging or demonstrative purposes.
If you want print the list you can do something like this:
{
use English;
local $OFS = "\n";
local $ORS = "\n";
print grep { !exists($existing{$_})} keys %hostFiles;
}
$ORS is the output record separator (it's printed after any print) and $OFS is the output field separator which is printed between the print arguments. See perlvar. You can get away with not using "English", but the variable names will look uglier. The block and the local are so you don't have to save and restore the values of the special variables.
If you want to write to a file the result something like this would do:
{
use English;
local $OFS = "\n";
local $ORS = "\n";
open F, ">host_$ARGV[0].log";
print F grep { !exists($existing{$_})} keys %hostFiles;
close F;
}
Of course, you can also do it the "classical" way, loop trough the array and print each element:
open F, ">host_$ARGV[0].log";
for my $missing_file (grep { !exists($existing{$_})} keys %hostFiles) {
use English;
local $ORS = "\n";
print F "File is missing: $missing_file"
}
close F;
This allows you to do more things with the file name, for example, you can SCP it over to the host.

It seems to me that looping over the 'required' list makes more sense - looping over the list of existing files isn't necessary unless you're looking for files that exist but aren't needed.
#!/usr/bin/perl
use strict;
use warnings;
my #hostFiles = ("filecheck.pl", "hostscript.pl", "awesomeness.txt");
my #output =`ssh $ARGV[0] "cd Desktop; ls -a"`;
chomp #output;
my #missingFiles;
foreach (#hostFiles) {
push( #missingFiles, $_ ) unless $_ ~~ #output;
}
print join("\n", "Missing files: ", #missingFiles);

#missing = "$file\n" assigns the array #missing to contain a single element, "$file\n". It does this every loop, leaving it with the last missing file.
What you want is push(#missing, "$file\n").

How can I extract just the elements I want from a Perl array?

Hey I'm wondering how I can get this code to work. Basically I want to keep the lines of $filename as long as they contain the $user in the path:
open STDERR, ">/dev/null";
$filename=`find -H /home | grep $file`;
#filenames = split(/\n/, $filename);
for $i (#filenames) {
if ($i =~ m/$user/) {
#keep results
} else {
delete $i; # does not work.
}
}
$filename = join ("\n", #filenames);
close STDERR;
I know you can delete like delete $array[index] but I don't have an index with this kind of loop that I know of.

You could replace your loop with:
#filenames = grep /$user/, #filenames;

There's no way to do it when you're using foreach loop. But nevermind. The right thing to do is to use File::Find to accomplish your task.
use File::Find 'find';
...
my #files;
my $wanted = sub {
return unless /\Q$file/ && /\Q$user/;
push #files, $_;
};
find({ wanted => $wanted, no_chdir => 1 }, '/home');
Don't forget to escape your variables with \Q for use in regular expressions.
BTW, redirecting your STDERR to /dev/null is better written as
{
local *STDERR;
open STDERR, '>', '/dev/null';
...
}
It restores the filehandle after exiting the block.

If you have a find that supports -path, then make it do the work for you, e.g.,
#! /usr/bin/perl
use warnings;
use strict;
my $user = "gbacon";
my $file = "bash";
my $path = join " -a " => map "-path '*$_*'", $user, $file;
chomp(my #filenames = `find -H /home $path 2>/dev/null`);
print map "$_\n", #filenames;
Note how backticks in list context give back a list of lines (including their terminators, removed above with chomp) from the command's output. This saves you having to split them yourself.
Output:
/home/gbacon/.bash_history
/home/gbacon/.bashrc
/home/gbacon/.bash_logout

If you want to remove an item from an array, use the multi-talented splice function.
my #foo = qw( a b c d e f );
splice( #foo, 3, 1 ); # Remove element at index 3.
You can do all sorts of other manipulations with splice. See the perldoc for more info.
As codeholic alludes to, you should never modify an array while iterating over it with a for loop. If you want to modify an array while iterating, use a while loop instead.
The reason for this is that for evaluates the expression in parens once, and maps each item in the result list to an alias. If the array changes, the pointers get screwed up and chaos will follow.
A while evaluates the condition each time through the loop, so you won't run into issues with pointers to non-existent values.