Comparing two arrays in Perl - arrays

I know this has been asked before, and I know there are functions to make this easy in Perl. But what I want is advice on my specific code. I want to go through each line of text which I've read from a file, and compare it to the same line from another file, printing them if they are different.
I've tried as many variations of this as I could think of, and none work. This specific code which I'm posting thinks every element in the array is different from the one in the other array.
use 5.18.2;
use strict;
use utf8;
printf "This program only compares two files.\n"
. "Here are the differences between "
. $ARGV[0] . " and " . $ARGV[1] . ":\n";
open FIRST_FH, '<', $ARGV[0];
chomp(my #file1 = <FIRST_FH>);
close FIRST_FH;
open SECOND_FH, '<', $ARGV[1];
chomp(my #file2 = <SECOND_FH>);
close SECOND_FH;
for(my $i=0; $i < scalar #file1; ++$i){
my $string = $file2[$i];
unless($_ =~ /$string/){
print "Difference found: #file1[$i], #file2[$i]\n";
}
}

use utf8; just instructs the interpreter to read your source file as UTF-8. Use the open pragma to set the default IO layers to UTF-8 (or manually specify '<:encoding(UTF-8)' as the second argument to open).
Don't use printf when print will suffice (it usually does, due to interpolation). In this particular instance, I find a heredoc to be most readable.
It's inefficient to read both files into memory. Iterate over them lazily by taking one line at a time in a while loop.
Always check if open failed and include $! in the error message. Alternatively, use autodie;, which handles this for you. Also, use lexical filehandles; they'll automatically close when they go out of scope, and won't clash with other barewords (e.g. subroutines and built-ins).
Keeping in mind these suggestions, the new code would look like:
#!/usr/bin/perl
use 5.18.2; # Implicitly loads strict
use warnings;
use open qw(:encoding(utf8) :std);
print <<"EOT";
This program only compares 2 files.
Here are the differences between
$ARGV[0] and $ARGV[1]:
EOT
open(my $file1, '<', shift) or die $!;
open(my $file2, '<', shift) or die $!;
while (my $f1_line = <$file1>, my $f2_line = <$file2>)
{
if ($f1_line ne $f2_line)
{
print $f1_line, $f2_line;
}
}
But this is still a naive algorithm; if one file has a line removed, all subsequent lines will differ between files. To properly achieve a diff-like comparison, you'll need an implementation of an algorithm that finds the longest common subsequence. Consider using the CPAN module Algorithm::Diff.

Why are you comparing using $_? Which you haven't defined anywhere?
my $string = $file2[$i];
unless($_ =~ /$string/){
Simply compare the lines using eq or ne:
if ( $file1[$i] ne $file2[$i] ) {
However, I would recommend that you make a lot of stylistic changes to your script, starting with doing line by line processing instead of slurping in the files. The following is how I would completely rewrite it:
use 5.18.2;
use strict;
use warnings;
use autodie;
use utf8;
my ( $file1, $file2 ) = #ARGV;
open my $fh1, '<', $file1;
open my $fh2, '<', $file2;
while ( !eof($fh1) && !eof($fh2) ) {
chomp( my $line1 = <$fh1> );
chomp( my $line2 = <$fh2> );
if ( line1 ne $line2 ) {
warn "Difference found on line $.:\n $line1\n $line2\n";
}
}
warn "Still more data in $file1\n" if !eof $fh1;
warn "Still more data in $file2\n" if !eof $fh2;

Related

How to load a CSV file into a perl hash and access each element

I have a CSV file with the following information seperated by commas ...
Owner,Running,Passing,Failing,Model
D42,21,54,543,Yes
T43,54,76,75,No
Y65,76,43,765,Yes
I want to open this CSV file and place its containments inside of a perl hash in my program. I am also interested in the code needed to print a specific element inside of the has. For example, how I will print the "Passing" count for the "Owner" Y65.
The code I currently have:
$file = "path/to/file";
open $f, '<', $files, or die "cant open $file"
while (my $line = <$f>) {
#inside here I am trying to take the containments of this file and place it into a hash. I have tried numerous ways of trying this but none have seemed to work. I am leaving this blank because I do not want to bog down the visibility of my code for those who are kind enough to help and take a look. Thanks.
}
AS well as placing the csv file inside of a hash I also need to understand the syntax to print and navigate through specific elements. Thank you very much in advance.
Here is an example of how to put the data into a hash %owners and later (after having read the file) extract a "passing count" for a particular owner. I am using the Text::CSV module to parse the lines of the file.
use feature qw(say);
use open qw(:std :utf8); # Assume UTF-8 files and terminal output
use strict;
use warnings qw(FATAL utf8);
use Text::CSV;
my $csv = Text::CSV->new ( )
or die "Cannot use CSV: " . Text::CSV->error_diag ();
my $fn = 'test.csv';
open my $fh, "<", $fn
or die "Could not open file '$fn': $!";
my %owners;
my $header = $csv->getline( $fh ); # TODO: add error checking
while ( my $row = $csv->getline( $fh ) ) {
next if #$row == 0; # TODO: more error checking
my ($owner, #values) = #$row;
$owners{$owner} = \#values;
}
close $fh;
my $key = 'Y65';
my $index = 1;
say "Passing count for $key = ", $owners{$key}->[$index];
Since it's not really clear what "load a CSV file into a perl hash" means (Nor does it really make sense. An array of hashes, one per row, maybe, if you don't care about keeping the ordering of fields, but just a hash? What are the keys supposed to be?), let's focus on the rest of your question, in particular
how I will print the "Passing" count for the "Owner" Y65.
There are a few other CSV modules that might be of interest that are much easier to use than Text::CSV:
Tie::CSV_File lets you access a CSV file like a 2D array. $foo[0][0] is the first field of the first row of the tied file.
So:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
use Tie::CSV_File;
my $csv = "data.csv";
tie my #data, "Tie::CSV_File", $csv or die "Unable to tie $csv!";
for my $row (#data) {
say $row->[2] and last if $row->[0] eq "Y65";
}
DBD::CSV lets you treat a CSV file like a table in a database you can run SQL queries on.
So:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
use DBI;
my $csv = "data.csv";
my $dbh = DBI->connect("dbi:CSV:", undef, undef,
{ csv_tables => { data => { f_file => $csv } } })
or die $DBI::errstr;
my $owner = "Y65";
my $p = $dbh->selectrow_arrayref("SELECT Passing FROM data WHERE Owner = ?",
{}, $owner);
say $p->[0] if defined $p;
Text::AutoCSV has a bunch of handy functions for working with CSV files.
So:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
use Text::AutoCSV;
my $csv = "data.csv";
my $acsv = Text::AutoCSV->new(in_file => $csv) or die "Unable to open $csv!";
my $row = $acsv->search_1hr("OWNER", "Y65");
say $row->{"PASSING"} if defined $row;
This last one is probably closest to what I think you think you want.

comparing two filename arrays for differences

below is my attempt and loading all filenames in a text file into an array and comparing that array to filenames which are in a seperate directory. I would like to identify the filenames that are in the directory and not in the file so I can then process those files. I am able to load the contents of the both directories succesfully but the compare operation is outputting all the files not just the difference.
Thank you in advance for the assistance.
use File::Copy;
use Net::SMTP;
use POSIX;
use constant DATETIME => strftime("%Y%m%d", localtime);
use Array::Utils qw(:all);
use strict;
use warnings;
my $currentdate = DATETIME;
my $count;
my $ErrorMsg = "";
my $MailMsg = "";
my $MstrTransferLogFile = ">>//CFVFTP/Users/ssi/Transfer_Logs/Artiva/ARTIVA_Mstr_Transfer_Log.txt";
my $DailyLogFile = ">//CFVFTP/Users/ssi/Transfer_Logs/Artiva/ARTIVA_Daily_Transfer_Log_" . DATETIME . ".txt";
my $InputDir = "//CFVFTP/Users/ssi/Transfer_Logs/folder1/";
my $MoveDir = "//CFVFTP/Users/ssi/Transfer_Logs/folder2/";
my $filetouse;
my #filetouse;
my $diff;
my $file1;
my $file2;
my %diff;
open (MSTRTRANSFERLOGFILE, $MstrTransferLogFile) or $ErrorMsg = $ErrorMsg . "ERROR: Could not open master transfer log file!\n";
open (DAILYLOGFILE, $DailyLogFile) or $ErrorMsg = $ErrorMsg . "ERROR: Could not open daily log file!\n";
#insert all files in master transfer log into array for cross reference
open (FH, "<//CFVFTP/Users/ssi/Transfer_Logs/Artiva/ARTIVA_Mstr_Transfer_Log.txt") or $ErrorMsg = $ErrorMsg . "ERROR: Could not open master log file!\n";
my #master = <FH>;
close FH;
print "filenames in text file:\n";
foreach $file1 (#master) { print "$file1\n"; }
print "\n";
#insert all 835 files in Input directory into array for cross reference
opendir (DIR, $InputDir) or $ErrorMsg = $ErrorMsg . "ERROR: Could not open input directory $InputDir!\n";
my #list = grep { $_ ne '.' && $_ ne '..' && /\.835$/ } readdir DIR;
close(DIR);
print "filenames in folder\n";
foreach $file2 (#list) { print "$file2\n"; }
print "\n";
#get the all files in the Input directory that are NOT in the master transfer log and place into #filetouse array
#diff{ #master }= ();;
#filetouse = grep !exists($diff{$_}), #list;;
print "difference:\n";
foreach my $file3 (#filetouse) { print "$file3\n"; }
print DAILYLOGFILE "$ErrorMsg\n";
print DAILYLOGFILE "$MailMsg\n";
close(MSTRTRANSFERLOGFILE);
close(DAILYLOGFILE);
this is what the output looks like:
filenames in text file:
160411h00448car0007.835
filenames in folder
160411h00448car0007.835
160411h00448car0008.835
160418h00001com0001.835
difference:
160411h00448car0007.835
160411h00448car0008.835
160418h00001com0001.835
This should help you to do what you need. It stores the names of all of the files in INPUT_DIR as keys in hash %files, and then deletes all the names found in LOG_FILE. The remainder are printed
This program uses autodie so that the success of IO operations needn't be checked explicitly. It was first available in Perl 5 core in v5.10.1
use strict;
use warnings 'all';
use v5.10.1;
use autodie;
use feature 'say';
use constant LOG_FILE => '//CFVFTP/Users/ssi/Transfer_Logs/Artiva/ARTIVA_Mstr_Transfer_Log.txt';
use constant INPUT_DIR => undef;
chdir INPUT_DIR;
my %files = do {
opendir my $dh, '.';
my #files = grep -f, readdir $dh;
map { $_ => 1 } #files;
};
my #logged_files = do {
open my $fh, '<', LOG_FILE;
<$fh>;
};
chomp #logged_files;
delete #files{#logged_files};
say for sort keys %files;
Update
After a lot of attrition I found this underneath your original code
use strict;
use warnings 'all';
use v5.10.1;
use autodie;
use feature 'say';
use Time::Piece 'localtime';
use constant DATETIME => localtime()->ymd('');
use constant XFR_LOG => '//CFVFTP/Users/ssi/Transfer_Logs/Artiva/ARTIVA_Mstr_Transfer_Log.txt';
use constant DAILY_LOG => '//CFVFTP/Users/ssi/Transfer_Logs/Artiva/ARTIVA_Daily_Transfer_Log_' . DATETIME . '.txt';
use constant INPUT_DIR => '//CFVFTP/Users/ssi/Transfer_Logs/folder1/';
use constant MOVE_DIR => '//CFVFTP/Users/ssi/Transfer_Logs/folder2/';
chdir INPUT_DIR;
my #master = do {
open my $fh, '<', XFR_LOG;
<$fh>;
};
chomp #master;
my #list = do {
opendir my $dh, '.';
grep -f, readdir $dh;
};
my %diff;
#diff{ #master } = ();
my #filetouse = grep { not exists $diff{$_} } #list;
As you can see, it's very similar to my solution. Here are some notes about your original
Always use lexical file handles. With open FH, ... the file handle is global and will never be closed unless you do it explicitly or until the program terminates. Instead, open my $fh, ... leaves perl to close the file handle at the end of the current block
Always use the three-parameter form of open, so that the open mode is separate from the file name, and never put an open mode as part of a file name. You opened the same file twice: once as $MstrTransferLogFile which begins with >> and once explicitly because you needed read access
It is very rare for a program to be able to recover from an IO operation error. Unless you are writing fail-safe software, a failure to open or read from a file or directory means the program won't be able to fulfill its purpose. That means there's little reason to accumulate a list of error messages -- the code should just die when it can't succeed
The output from readdir is very messy if you need to process directories because it includes the pseudo-directories . and ... But if you only want files then a simple grep -f, readdir $dh will throw those out for you
The block form of grep is often more readable, and not is much more visible than !. So grep !exists($diff{$_}), #list is clearer as grep { not exists $diff{$_} } #list
Unless your code is really weird, comments usually just add more noise and confusion and obscure the structure. Make your code look like what it does, so you don't have to explain it
Oh, and don't throw in all the things you might need at the start "just in case". Write your code as if it was all there and the compiler will tell you what's missing
I hope that helps
First, use a hash to store your already-processed files. Then it's just a matter of checking if a file exists in the hash.
(I've changed some variable names to make the answer a bit clearer.)
foreach my $file (#dir_list) {
push #to_process, $file unless ($already_processed{$file});
}
(Which could be a one-liner, but get it working in its most expanded form first.)
If you insist on your array, this looks much less efficient
foreach my $file (#dir_list) {
push #to_process, $file unless (grep (/^$file$/, #already_processed));
}
(Again could be a one-liner, but...)

confusing filehandle in perl

Have been playing with the following script but still couldn't understand the meaning behind the two different "kinds" of filehandle forms. Any insight will be hugely appreciated.
#! usr/bin/perl
use warnings;
use strict;
open (FH, "example.txt") or die $!;
while (<FH>) {
my #line = split (/\t/, $_); {
print "#line","\n";
}
}
The output is as expected: #line array contains elements from line 1,2,3 ... from example.txt. As I was told that open (FH, example.txt) is not as good as open (my $fh, '<', 'example.txt'), I changed it but then confusion arose.
From what I found, $fh is scalar and contains ALL info in example.txt. When I assigned an array to $fh, the array stored each line in example.txt as a component in the array. However, when I tried to further split the component into "more components", I got the error/warning message "use of uninitialized value". Below is the actual script that shows the error/warning message.
open (my $fh, '<', 'example.txt') or die $!;
foreach ($fh) {
my #line = <$fh>;
my $count = 0;
for $count (0..$#line) {
my #line2 = split /\t/, $line[$count];
print "#line2";
print "$line2[0]";
}
}
print "#line2" shows the expected output but print "$line2[0]" invokes the error/warning message. I thought if #line2 is a true array, $line2[0] should be okay. But why "uninitialized value" ??
Any help will be appreciated. Thank you very much.
Added -
the following is the "actual" script (I re-ran it and the warning was there)
#! usr/bin/perl
use warnings;
use strict;
open (my $fh, '<', 'example.txt') or die $!;
foreach ($fh) {
my #line = <$fh>;
print "$line[1]";
my $count = 0;
for my $count (0..$#line) {
my #line2 = split /\t/, $line[$count];
print "#line2";
#my $line2_count = $#line2;
#print $line2_count;
print "$line2[3]";
}
}
The warning is still use of uninitialized value $line2[3] in string at filename.pl line 15, <$fh> line3.
In your second example, you are reading the filehandle in a list context, which I think is the root of your problem.
my $line = <$fh>;
Reads one line from the filehandle.
my #lines = <$fh>;
Reads all the file.
Your former example, thanks to the
while (<FH>) {
Is effectively doing the first case.
But in the second example, you are doing the second thing.
AFAIK, you should always use
while (<FH>) {
# use $_ to access the content
}
or better
while(my $single_line = <FH>) {
# use $single_line to access the content
}
because while reads line by line where for first loads all in memory and iterates it after.
Even the returns undef on EOF or error, the check for undef is added by the interpreter when not explicitly done.
So with while you can load multi gigabyte log files without any issue and without wasting RAM where you can't with for loops that require arrays to be iterated.
At least this is how I remember it from a Perl book that I read some years ago.

Perl: Replace strings in multiple files with array entry

I am looking for a simple way to replace strings in multiple text files. In the first file the string should be replaced with the first element of the array #arrayF; in the second file the string must be replaced with the second entry etc.
I want to replace ;size=\d+ where \d+ is a wildcard for any number.
This is what I have so far:
#!/usr/bin/perl -w
use strict;
use warnings;
my $counter = 0;
my #arrayF = '/Users/majuss/Desktop/filelist.txt>'; # Reads all lines into array
my #files = '/Users/majuss/Desktop/New_Folder/*'; #get Files into an array
foreach my $file ( #files ) {
$file =~ s/;size=\d+/$arrayF[$counter]/g; #subst.
print
$counter++; #increment array index
}
It gives a zero back and nothing happens.
I know how to do it in a one-liner but I can't figure a way out how to implement an array there.
Note these points that I commented on beneath your question
The line commented Reads all lines into array doesn't do that. It simply sets #arrayF to a one-element list that holds the string /Users/majuss/Desktop/filelist.txt>. You probably need to open the file and read its contents into an array
The line commented get Files into an array doesn't do that. It simply sets #files to a one-element list that holds the string /Users/majuss/Desktop/New_Folder/*. You probably need to use glob to expand the wildcard into a list of files
The statement
$file =~ s/;size=\d+/$arrayF[$counter]/g
is attempting to modify the variable $file which contains the name of the file. Presumably you meant to edit the contents of that file, so you must open and read it first
Please don't use upper-case letters in your local identifiers
Don't use -w on the shebang line as well as use warnings; just the latter is correct
This seems to do what you're asking for, but be aware that it is untested except that I have checked that it will compile. Be careful that you have a backup of the original files as this code will overwrite the original files with the modified data
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use autodie;
my $replacement_text = '/Users/majuss/Desktop/filelist.txt';
my $file_glob = '/Users/majuss/Desktop/New_Folder/*';
my #replacement_text = do {
open my $fh, '<', $replacement_text;
<$fh>;
};
chomp #replacement_text;
my $i = 0;
for my $file ( glob $file_glob ) {
my $contents = do {
open my $in_fh, '<', $file;
local $/;
<$in_fh>;
};
$contents =~ s/;size=\d+/$replacement_text[$i]/g;
open my $out_fh, '>', $file;
print $out_fh $contents;
++$i;
}
You're not opening filelist.txt and reading it.
Do do this you need to:
open ( my $input, "<", '/Users/majuss/Desktop/filelist.txt' ) or die $!;
my #arrayF = <$input>;
close ( $input );
You need to use glob to search a directory pattern like that.
Like this:
foreach my $file ( glob ( '/Users/majuss/Desktop/New_Folder/*' ) {
# stuff
}
To search and replace within a file, it's actually a bit different to a one liner. You can look at 'in place editing' in perlrun - but this is where perl is trying to pretend to be sed. I think you can do it if you try - there's an option in perlvar:
$^I
The current value of the inplace-edit extension. Use undef to disable inplace editing.
Mnemonic: value of -i switch.
This answer may give some insight:
In-place editing of multiple files in a directory using Perl's diamond and in-place edit operator
Instead you can:
foreach my $file ( glob ( '/Users/majuss/Desktop/New_Folder/*' ) {
open ( my $input_fh, "<", $file ) or die $!;
open ( my $output_fh, ">", "$file.NEW" ) or die $!;
my $replace = shift ( #arrayF );
while ( my $line = <$input_fh> ) {
$line =~ s/;size=\d+/$replace/g;
print {$output_fh} $line;
}
close ( $input_fh );
close ( $output_fh );
#rename 'output'.
}

Perl - Open large txt file on server and create / save into smaller files of 100 lines each

I am trying to do this:
I FTP a large file of single words (~144,000 and one word per line)
I need to open uploaded file and create files with 100 lines max one
word per line (01.txt, 02.txt etc).
I would like the processed 100 to be REMOVED from the original file
AFTER the file of 100 is created.
The server is shared but, I can install modules if needed.
Now, my code below is very crude as my knowledge is VERY limited. One problem is opening the whole file into an array? The shared server does not sport enough memory I assume to open such a large file and read into memory all at once? I just want the first 100 lines. Below is just opening a file that is small enough to be loaded and getting 100 lines into an array. Nothing else. I typed it quickly so, prob has several issues but, show my limited knowledge and need for help.
use vars qw($Word #Words $IN);
my $PathToFile = '/home/username/public/wordlists/Big-File-Of-Words.txt';
my $cnt= '0';
open $IN, '<', "$PathToFile" or die $!;
while (<$IN>) {
chomp;
$Word = $_;
$Word=~ s/\s//g;
$Word = lc($Word);
######
if ($cnt <= 99){
push(#Words,$Word);
}
$cnt++;
}
close $IN;
Thanks so much.
Okay, I am trying to implement the code below:
#!/usr/bin/perl -w
BEGIN {
my $b__dir = (-d '/home/username/perl'?'/home/username/perl':( getpwuid($>) )[7].'/perl');
unshift #INC,$b__dir.'5/lib/perl5',$b__dir.'5/lib/perl5/x86_64-linux',map { $b__dir . $_ } #INC;
}
use strict;
use warnings;
use CGI;
use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
print CGI::header();
my $WORD_LIST='/home/username/public/wordlists/Big-File-Of-Words.txt';
sed 's/ *//g' $WORD_LIST | tr '[A-Z]' '[a-z]' | split -l 100 -a6 - words.
print 'Done';
1;
But I get:
syntax error at split-up-big-file.pl line 12, near "sed 's/ *//g'"
Can't find string terminator "'" anywhere before EOF at split-up-big-file.pl line 12.
FINALLY:
Well I figured out a quick solution that works. Not pretty:
#!/usr/bin/perl -w
BEGIN {
my $b__dir = (-d '/home/username/perl'?'/home/username/perl':( getpwuid($>) )[7].'/perl');
unshift #INC,$b__dir.'5/lib/perl5',$b__dir.'5/lib/perl5/x86_64-linux',map { $b__dir . $_ } #INC;
}
use strict;
use warnings;
use CGI;
use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
use diagnostics;
print CGI::header();
my $sourcefile = '/home/username/public_html/test/bigfile.txt';
my $rowlimit = 100;
my $cnt= '1';
open(IN, $sourcefile) or die "Failed to open $sourcefile";
my $outrecno = 1;
while(<IN>) {
if($outrecno == 1) {
my $filename= $cnt.'.txt';
open OUT, ">$filename" or die "Failed to create $filename";
$cnt++;
}
print OUT $_;
if($outrecno++ == $rowlimit) {
$outrecno = 1;
close FH;
}
}
close FH;
I found enough info here to get me going. Thanks...
Here is a solution based on a slight modification of your code that should work approximately the way you want it.
It loops through all the lines of the input file and for every 100th line it will write the word list of the words encountered since the last write (or the beginning). The eof($IN) check is to catch the remaining lines if they are less than 100.
use strict;
use warnings;
my $PathToFile = '/home/username/public/wordlists/Big-File-Of-Words.txt';
open my $IN, '<', "$PathToFile" or die $!;
my $cnt = 0;
my $cnt_file = 0;
my #Words;
while ( my $Word = <$IN> ) {
chomp $Word;
$Word =~ s/\s//g;
$Word = lc($Word);
######
push(#Words,$Word);
if ( !(++$cnt % 100) || eof($IN) ) {
$cnt_file++;
open my $out_100, '>', "file_$cnt_file.txt" or die $!;
print $out_100 join("\n", #Words), "\n";
close $out_100;
#Words = ();
}
}
There's a non-Perl solution that you might find interesting...
$ split -l 100 -a6 /home/username/public/wordlists/Big-File-Of-Words.txt words.
This will split your big file of words into a bunch of files with no more than 100 lines each. The file name will start with words., and the suffix will range from aaaaaa to zzzzzz. Thus, you'll have words.aaaaaa, words.aaaaab, words.aaaaac, etc. You can then recombine all of these files back into your word list like this:
$ cat words.* > reconstituted_word_list.txt
Of course, you want to eliminate spaces, and lowercase the words all at the same time:
$ WORD_LIST=/home/username/public/wordlists/Big-File-Of-Words.txt
$ sed 's/ *//g' $WORD_LIST | tr '[A-Z]' '[a-z]' | split -l 100 -a6 - words.
The tr is the transformation command, and will change all uppercase to lower case. The split splits the files, and sed removes the spaces.
One of Unix's big strengths was its file handling ability. Splitting up big files into smaller pieces and reconstituting them was a common task. Maybe you had a big file, but a bunch of floppy disks that couldn't hold more than 100K per floppy. Maybe you were trying to use UUCP to copy these files over to another computer and there was a 10K limit on file transfer sizes. Maybe you were doing FTP by email, and the system couldn't handle files larger than 5K.
Anyway, I brought it up because it's probably an easier solution in your case than writing a Perl script. I am a big writer of Perl, and many times Perl can handle a task better and faster than shell scripts can. However, in this case, this is an easy task to handle in shell.
Here's a pure Perl solution. The problem is that you want to create files after every 100 lines.
To solve this, I have two loops. One is an infinite loop, and the other loops 100 times. Before I enter the inner loop, I create a file for writing, and write one word per line. When that inner loop ends, I close the file, increment my $output_file_num and then open another file for output.
A few changes:
I use use warnings; and use strict (which is included when you specify that you want Perl version 5.12.0 or greater).
Don't use use vars;. This is obsolete. If you have to use package variables, declare the variable with our instead of my. When should you use package variables? If you have to ask that question, you probably don't need package variables. 99.999% of the time, simply use my to declare a variable.
I use constant to define your word file. This makes it easy to move the file when needed.
My s/../../ not only removes beginning and ending spaces, but also lowercases my word for me. The ^\s*(.*?)\s*$ removes the entire line, but captures the word sans spaces at the beginning and end of the word. The .*? is like .*, but is non-greedy. It will match the minimum possible (which in this case does not include spaces at the end of the word).
Note I define a label INPUT_WORD_LIST. I use this to force my inner last to exit the outer loop.
I take advantage of the fact that $output_word_list_fh is defined only in the loop. Once I leave the loop, the file is automatically closed for me since $output_word_list_fh in out of scope.
And the program:
#!/usr/bin/env perl
use 5.12.0;
use warnings;
use autodie;
use constant WORD_FILE => "/home/username/public/wordlists/Big-File-Of-Words.txt";
open my $input_word_list_fh, "<", WORD_FILE;
my $output_file_num = 0;
INPUT_WORD_LIST:
for (;;) {
open my $output_word_list_fh, ">", sprintf "%05d.txt", $output_file_num;
for my $line (1..100) {
my $word;
if ( not $word = <$input_word_list_fh> ) {
last INPUT_WORD_LIST;
}
chomp $word;
$word =~ s/^\s*(.*?)\s*$/\L$1\E/;
say {$output_word_list_fh} "$word";
}
close $output_word_list_fh;
$output_file_num += 1;
}
close $input_word_list_fh;

Resources