Creating a list of duplicate filenames with Perl - arrays

I've been trying to write a script to pre-process some long lists of files, but I am not confident (nor competent) with Perl yet and am not getting the results I want.
The script below is very much work in progress but I'm stuck on the check for duplicates and would be grateful if anyone could let me know where I am going wrong. The block dealing with duplicates seems to be of the same form as examples I have found but it doesn't seem to work.
#!/usr/bin/perl
use strict;
use warnings;
open my $fh, '<', $ARGV[0] or die "can't open: $!";
foreach my $line (<$fh>) {
# Trim list to remove directories which do not need to be checked
next if $line =~ m/Inventory/;
# MORE TO DO
next if $line =~ m/Scanned photos/;
$line =~ s/\n//; # just for a tidy list when testing
my #split = split(/\/([^\/]+)$/, $line); # separate filename from rest of path
foreach (#split) {
push (my #filenames, "$_");
# print "#filenames\n"; # check content of array
my %dupes;
foreach my $item (#filenames) {
next unless $dupes{$item}++;
print "$item\n";
}
}
}
I am struggling to understand what is wrong with my check for duplicates. I know the array contains duplicates (uncommenting the first print function gives me a list with lots of duplicates). The code as it stands generates nothing.
Not the main purpose of my post but my final aim is to remove unique filenames from the list and keep filenames which are in duplicated in other directories.
I know that none of these files are identical but many are different versions of the same file which is why I'm focussing on filename.
Eg I would want an input of:
~/Pictures/2010/12345678.jpg
~/Pictures/2010/12341234.jpg
~/Desktop/temp/12345678.jpg
to give an output of:
~/Pictures/2010/12345678.jpg
~/Desktop/temp/12345678.jpg
So I suppose ideally it would be good to check for uniqueness of a match based on the regex without splitting if that is possible.

This below loop does nothing, because the hash and the array only contain one value for each loop iteration:
foreach (#split) {
push (my #filenames, "$_"); # add one element to lexical array
my %dupes;
foreach my $item (#filenames) { # loop one time
next unless $dupes{$item}++; # add one key to lexical hash
print "$item\n";
}
} # #filenames and %dupes goes out of scope
A lexical variable (declared with my) has a scope that extends to the surrounding block { ... }, in this case your foreach loop. When they go out of scope, they are reset and all the data is lost.
I don't know why you copy the file names from #split to #filenames, it seems very redundant. The way to dedupe this would be:
my %seen;
my #uniq;
#uniq = grep !$seen{$_}++, #split;
Additional information:
You might also be interested in using File::Basename to get the file name:
use File::Basename;
my $fullpath = "~/Pictures/2010/12345678.jpg";
my $name = basename($fullpath); # 12345678.jpg
Your substitution
$line =~ s/\n//;
Should probably be
chomp($line);
When you read from a file handle, using for (foreach) means you read all the lines and store them in memory. It is preferable most times to instead use while, like this:
while (my $line = <$fh>)

TLP's answer gives lots of good advice. In addition:
Why use both an array and a hash to store the filenames? Simply use the hash as your one storage solution, and you will automatically remove duplicates. i.e:
my %filenames; #outside of the loops
...
foreach (#split) {
$filenames{$_}++;
}
Now when you want to get the list of unique filenames, just use keys %filenames or, if you want them in alphabetical order, sort keys %filenames. And the value for each hash key is a count of occurrences, so you can find out which ones were duplicated if you care.

Related

Can't use string as an ARRAY ref while strict refs in use

Getting an error when I attempt to dump out part of a multi dimensional hash array. Perl spits out
Can't use string ("somedata") as an ARRAY ref while "strict refs" in use at ./myscript.pl
I have tried multiple ways to access part of the array I want to see but I always get an error. I've used Dumper to see the entire array and it looks fine.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper qw(Dumper);
use String::Util qw(trim);
my %arrHosts;
open(my $filehdl, "<textfile.txt") || die "Cannot open or find file textfile.txt: $!\n";
while( my $strInputline = <$filehdl> ) {
chomp($strInputline);
my ($strHostname,$strOS,$strVer,$strEnv) = split(/:/, $strInputline);
$strOS = lc($strOS);
$strVer = trim($strVer);
$strEnv = trim($strEnv);
$strOS = trim($strOS);
$arrHosts{$strOS}{$strVer}{$strEnv} = $strHostname;
}
# If you want to see the entire database, remove the # in front of Dumper
print Dumper \%arrHosts;
foreach my $machine (#{$arrHosts{solaris}{10}{DEV}}) {
print "$machine\n";
}
close($filehdl);
The data is in the form
machine:OS:OS version:Environment
For example
bigserver:solaris:11:PROD
smallerserver:solaris:11:DEV
I want to print out only the servers that are solaris, version 11, in DEV. Using hashes seems the easiest way to store the data but alas, Perl barfs when attempting to print out only a portion of it. Dumper works great but I don't want to see everything. Where did I go wrong??
You have the following:
$arrHosts{$strOS}{$strVer}{$strEnv} = $strHostname;
That means the following contains a string:
$arrHosts{solaris}{10}{DEV}
You are treating it as if it contains a reference to an array. To group the hosts by OS+ver+env, replace
$arrHosts{$strOS}{$strVer}{$strEnv} = $strHostname;
with
push #{ $arrHosts{$strOS}{$strVer}{$strEnv} }, $strHostname;
Iterating over #{ $arrHosts{solaris}{10}{DEV} } will then make sense.
My previous code also had the obvious problem whereby if the combo of OS, Version, and Environment were the same it wrote over previous data. Blunderful. Push is the trick

Why can't I lookup an array index inside a foreach loop in Powershell?

first question on here so forgive me if I make any mistakes, I will try to stick to the guidelines.
I am trying to write a PowerShell script that populates two arrays from data I read in via CSV file. I'm using the arrays to cross-reference directory names in order to rename each directory. One array contains the current name of the directory and the other array contains the new name.
This all seems to be working so far. I successfully create and populate the arrays, and using a short input and index lookup to check my work I can search one array for a current name and successfully retrieve the correct new name from the second array. However when I try to implement this code in a foreach loop that runs through every directory name, I can't lookup the array index (it keeps coming back as -1).
I used the code in the first answer found here as my template.
Read a Csv file with powershell and capture corresponding data . Here's my modification to the input lookup, which works just fine:
$input = Read-Host -Prompt "Merchant"
if($Merchant -contains $input)
{
Write-Host "It's there!"
$find = [array]::IndexOf($Merchant, $input)
Write-Host Index is $find
}
Here is my foreach loop that attempts to use the Index lookup, but returns -1 every time. However I know it's finding the file because it enters the if statement and prints "It's there!"
foreach($file in Get-ChildItem $targetDirectory)
{
if($Merchant -contains $file)
{
Write-Host "It's there!"
$find = [array]::IndexOf($Merchant, $file)
Write-Host Index is $find
}
}
I can't figure it out. I'm a PowerShell newb so maybe it's a simple syntax problem, but it seems like it should work and I can't find where I'm going wrong.
Your problem seems to be that $Merchant is a collection of file names (of type string), whereas $file is a FileInfo object.
The -contains operator expects $file to be a string, since $Merchant is a string array, and works as you expect (since FileInfo.ToString() just returns the file name).
IndexOf() isn't so forgiving. It recognizes that none of the items in $Merchant are of the type FileInfo, so it never finds $file.
You can either refer directly to the file name:
[array]::IndexOf($Merchant,$file.Name)
or, as #PetSerAl showed, convert $file to a string instead:
[array]::IndexOf($Merchant,[string]$file)
# or
[array]::IndexOf($Merchant,"$file")
# or
[array]::IndexOf($Merchant,$file.ToString())
Finally, you can call IndexOf() directly on the array, no need to use the static method:
$Merchant.IndexOf($file.Name)

(Perl) Trying to write a foreach statement with a simple array. Confused with the formatting

I'm a beginner in programming, this is my first language. And in my class we are using a slightly out of date book to learn with (Book copyrighted '02). Doubt this would affect you helping me much, but worth noting.
The problem
I don't know how to format a simple foreach statement using/combined with an array. I'm getting mixed up and my book doesn't provide examples. I'm trying to get it so the Uses/Primary_Uses are shown when the user checks multiple checkboxes.
#!/usr/bin/perl
#c04ex5.cgi - creates a dynamic Web page that acknowledges
#the receipt of a registration form
print "Content-type: text/html\n\n";
use CGI qw(:standard -debug);
use strict;
#declare variables
my ($name, $serial, $modnum, $sysletter, $primary_uses, $use, #primary_uses, #uses);
my #models = ("Laser JX", "Laser PL", "ColorPrint XL");
my #systems = ("Windows", "Macintosh", "UNIX");
my #primary_uses = ("Home", "Business", "Educational", "Other");
#assign input items to variables
$name = param('Name');
$serial = param('Serial');
$modnum = param('Model');
$sysletter = param('System');
#primary_uses = param('Use');
#create Web page
print "<HTML><HEAD><TITLE>Juniper Printers</TITLE></HEAD>\n";
print "<BODY><H2>\n";
print "Thank you , $name, for completing \n";
print "the registration form.<BR><BR>\n";
print "We have registered your Juniper $models[$modnum] printer, \n";
print "serial number $serial.\n";
print "You indicated that the printer will be used on the\n";
print "$systems[$sysletter] system. <BR>\n";
print "The primary uses for this printer will be the following:\n";
#The part I'm having trouble with.
foreach $use (#primary_uses) {
print "$use [#use]<BR>\n";
}
print "</H2></BODY></HTML>\n";
My naming of variables might be a bit off, I was getting desperate and making sure I declare more than I should.
If you wanted to print a simple list of items, you should just use the $use variable:
foreach $use (#primary_uses) {
print "$use<BR>\n";
}
Note that this will also remove the fatal error that comes from not declaring #use. Perhaps that was also a point of confusion for you. $use and #use are two completely different variables, despite having the same name.
Note that you can print a list with the CGI module very easily:
my $cgi = CGI->new;
print $cgi->li(\#primary_uses);
Outputs the list interpolated in a list html entity, like so:
<li>Home</li> <li>Business</li> <li>Educational</li> <li>Other</li>
Some other pointers:
Note that it is a good idea to declare your variables in the smallest scope possible
foreach my $use (#primary_uses) { # note the use of "my"
print "$use<BR>\n";
}
That also goes with the other variables. A good idea is to declare them right as you initialize them:
my $name = param('Name');
Then people who read your code don't have to scan backwards in the file to see where the variable has "been" before.
Note that you should never, ever use the content of data from a web form without sanitizing it first, because it is a huge security risk, especially when you print it. It allows a web user to execute arbitrary code on your system.
You should know that for and foreach are aliases for the same function.
Also, you should always, always use warnings:
use warnings;
There really is no good reason to ever not turn warnings on.
foreach my $myuse (#primary_uses) {
print $myuse;
}
You need to declare the variable.

perl, removing elements from array in for loop

will the following code always work in perl ?
for loop iterating over #array {
# do something
if ($condition) {
remove current element from #array
}
}
Because I know in Java this results in some Exceptions, The above code is working for me for now, but I want to be sure that it will work for all cases in perl. Thanks
Well, it's said in the doc:
If any part of LIST is an array, foreach will get very confused if you
add or remove elements within the loop body, for example with splice.
So don't do that.
It's a bit better with each:
If you add or delete a hash's elements while iterating over it,
entries may be skipped or duplicated--so don't do that. Exception: In
the current implementation, it is always safe to delete the item most
recently returned by each(), so the following code works properly:
while (($key, $value) = each %hash) {
print $key, "\n";
delete $hash{$key}; # This is safe
}
But I suppose the best option here would be just using grep:
#some_array = grep {
# do something with $_
some_condition($_);
} #some_array;

Perl: Hash within Array within Hash

I am trying to build a Hash that has an array as one value; this array will then contain hashes. Unfortunately, I have coded it wrong and it is being interpreted as a psuedo-hash. Please help!
my $xcHash = {};
my $xcLine;
#populate hash header
$xcHash->{XC_HASH_LINES} = ();
#for each line of data
$xcLine = {};
#populate line hash
push(#{$xcHash->{XC_HASH_LINES}}, $xcLine);
foreach $xcLine ($xcHash->{XC_HASH_LINES})
#psuedo-hash error occurs when I try to use $xcLine->{...}
$xcHash->{XC_HASH_LINES} is an arrayref and not an array. So
$xcHash->{XC_HASH_LINES} = ();
should be:
$xcHash->{XC_HASH_LINES} = [];
foreach takes a list. It can be a list containing a single scalar (foreach ($foo)), but that's not what you want here.
foreach $xcLine ($xcHash->{XC_HASH_LINES})
should be:
foreach my $xcLine (#{$xcHash->{XC_HASH_LINES}})
foreach $xcLine ($xcHash->{XC_HASH_LINES})
should be
foreach $xcLine ( #{ $xcHash->{XC_HASH_LINES} } )
See http://perlmonks.org/?node=References+quick+reference for easy to remember rules for how to dereference complex data structures.
Golden Rule #1
use strict;
use warnings;
It might seem like a fight at the beginning, but they will instill good Perl practices and help identify many syntactical errors that might otherwise go unnoticed.
Also, Perl has a neat feature called autovivification. It means that $xcHash and $xcLine need not be pre-defined or constructed as references to arrays or hashes.
The issue faced here is to do with the not uncommon notion that a scalar can hold an array or hash; it doesn't. What it holds is a reference. This means that the $xcHash->{XC_HASH_LINES} is an arrayref, not an array, which is why it needs to be dereferenced as an array using the #{...} notation.
Here's what I would do:
my %xcHash;
for each line of data:
push #{$xcHash{XC_HASH_LINES}},$xcLine;

Resources