confusing filehandle in perl - arrays

Have been playing with the following script but still couldn't understand the meaning behind the two different "kinds" of filehandle forms. Any insight will be hugely appreciated.
#! usr/bin/perl
use warnings;
use strict;
open (FH, "example.txt") or die $!;
while (<FH>) {
my #line = split (/\t/, $_); {
print "#line","\n";
}
}
The output is as expected: #line array contains elements from line 1,2,3 ... from example.txt. As I was told that open (FH, example.txt) is not as good as open (my $fh, '<', 'example.txt'), I changed it but then confusion arose.
From what I found, $fh is scalar and contains ALL info in example.txt. When I assigned an array to $fh, the array stored each line in example.txt as a component in the array. However, when I tried to further split the component into "more components", I got the error/warning message "use of uninitialized value". Below is the actual script that shows the error/warning message.
open (my $fh, '<', 'example.txt') or die $!;
foreach ($fh) {
my #line = <$fh>;
my $count = 0;
for $count (0..$#line) {
my #line2 = split /\t/, $line[$count];
print "#line2";
print "$line2[0]";
}
}
print "#line2" shows the expected output but print "$line2[0]" invokes the error/warning message. I thought if #line2 is a true array, $line2[0] should be okay. But why "uninitialized value" ??
Any help will be appreciated. Thank you very much.
Added -
the following is the "actual" script (I re-ran it and the warning was there)
#! usr/bin/perl
use warnings;
use strict;
open (my $fh, '<', 'example.txt') or die $!;
foreach ($fh) {
my #line = <$fh>;
print "$line[1]";
my $count = 0;
for my $count (0..$#line) {
my #line2 = split /\t/, $line[$count];
print "#line2";
#my $line2_count = $#line2;
#print $line2_count;
print "$line2[3]";
}
}
The warning is still use of uninitialized value $line2[3] in string at filename.pl line 15, <$fh> line3.

In your second example, you are reading the filehandle in a list context, which I think is the root of your problem.
my $line = <$fh>;
Reads one line from the filehandle.
my #lines = <$fh>;
Reads all the file.
Your former example, thanks to the
while (<FH>) {
Is effectively doing the first case.
But in the second example, you are doing the second thing.

AFAIK, you should always use
while (<FH>) {
# use $_ to access the content
}
or better
while(my $single_line = <FH>) {
# use $single_line to access the content
}
because while reads line by line where for first loads all in memory and iterates it after.
Even the returns undef on EOF or error, the check for undef is added by the interpreter when not explicitly done.
So with while you can load multi gigabyte log files without any issue and without wasting RAM where you can't with for loops that require arrays to be iterated.
At least this is how I remember it from a Perl book that I read some years ago.

Related

Iterate through a file multiple times, each time finding a regex and returning one line (perl)

I have one file with ~90k lines of text in 4 columns.
col1 col2 col3 value1
...
col1 col2 col3 value90000
A second file contains ~200 lines, each one corresponding to a value from column 4 of the larger file.
value1
value2
...
value200
I want to read in each value from the smaller file, find the corresponding line in the larger file, and return that line. I have written a perl script that places all the values from the small file into an array, then iterates through that array using each value as a regex to search through the larger file. After some debugging, I feel like I have it almost working, but my script only returns the line corresponding to the LAST element of the array.
Here is the code I have:
open my $fh1, '<', $file1 or die "Could not open $file1: $!";
my #array = <$fh1>;
close $fh1;
my $count = 0;
while ($count < scalar #array) {
my $value = $array[$count];
open my $fh2, '<', $file2 or die "Could not open $file2: $!";
while (<$fh2>) {
if ($_ =~ /$value/) {
my $line = $_;
print $line;
}
}
close $fh2;
$count++;
}
This returns only:
col1 col2 col3 value200
I can get it to print each value of the array, so I know it's iterating through properly, but it's not using each value to search the larger file as I intended. I can also plug any of the values from the array into the $value variable and return the appropriate line, so I know the lines are there. I suspect my bug may have to do with either:
newlines in the array elements, since all the elements have a newline except the last one. I've tried chomp but get the same result.
or
something to do with the way I'm handling the second file with opening/closing. I've tried moving or removing the close command and that either breaks the code or doesn't help.
You should only be reading the 90k line file once, and checking each value from the other file against the fourth column of each line as you do, instead of reading the whole large file once per line of the smaller one:
#!usr/bin/env perl
use warnings;
use strict;
use feature qw/say/;
my ($file1, $file2) = #ARGV;
# Read the file of strings to match against
open my $fh1, '<', $file1 or die "Could not open $file1: $!";
my %words = map { chomp; $_ => 1 } <$fh1>;
close $fh1;
# Process the data file in one pass
open my $fh2, '<', $file2 or die "Could not open $file2: $!";
while (my $line = <$fh2>) {
chomp $line;
# Only look at the fourth column
my #fields = split /\s+/, $line, 4;
say $line if exists $words{$fields[3]};
}
close $fh2;
Note this uses a straight up string comparison (Via hash key lookup) against the last column instead of regular expression matching - your sample data looks like that's all that's needed. If you're using actual regular expressions, let me know and I'll update the answer.
Your code does look like it should work, just horribly inefficiently. In fact, after adjusting your sample data so that more than one line matches, it does print out multiple lines for me.
Slightly different approach to the problem
use warnings;
use strict;
use feature 'say';
my $values = shift;
open my $fh1, '<', $values or die "Could not open $values";
my #lookup = <$fh1>;
close $fh1;
chomp #lookup;
my $re = join '|', map { '\b'.$_.'\b' } #lookup;
((split)[3]) =~ /$re/ && print while <>;
Run as script.pl value_file data_file

Tie::File - Get 5 lines from a file into an array while removing them from file

I want to open a file, push 5 lines into an array for later use (or what is left if less than 5) and remove those 5 lines from the file as well.
It does not matter whether I am removing (or pushing) from head or tail of file.
I have used Tie::File in the past and am willing to use it, but I cannot figure it out with or without the Tie module.
use Tie::File;
my $limit='5';
$DataFile='data.txt';
###open my $f, '<', $DataFile or die;
my #lines;
tie (#lines, 'Tie::File', $DataFile);
$#lines = $limit;
###while( <#lines> ) {
shift #lines if #lines <= $limit;
push (#lines, $_);
###}
print #lines;
untie #lines;
Also tried File::ReadBackwards from an example I found but, I cannot figure out how to get the array of 5.
my $pos = do {
my $fh = File::ReadBackwards->new($DataFile) or die $!;
##lines =(<FILE>)[1..$limit];
#$fh->readline() for 1..$limit;
my $log_line = $fh->readline for 1..$limit;
print qq~ LogLine $log_line~;
$fh->tell()};
All that said, this came close, but no cigar. How do I get the 5 into an array?
use File::ReadBackwards;
my $num_lines = 5;
my $pos = do {
my $fh = File::ReadBackwards->new($DataFile) or die $!;
$fh->readline() for 1..$num_lines;
$fh->tell()};
truncate($DataFile, $pos)or die $!;
I will check each line in the array against a regex later on. They still need to be removed from the file either way.
If you extract the last five lines instead of the first five, then you can use truncate instead of writing the entire file. Furthermore, you can use File::ReadBackwards to get those five lines without reading the entire file. That makes the following solution insanely faster than Tie::File for large files (and it will use far less memory):
use File::ReadBackwards qw( );
my $num_lines = 5;
my $fh = File::ReadBackwards->new($DataFile)
or die("Can't open $DataFile: $!\n");
my #extracted_lines;
while ($_ = $fh->readline() && #extracted_lines < $num_lines) {
push #extracted_lines, $_;
}
truncate($fh->get_handle(), $fh->tell())
or die("Can't truncate $DataFile: $!\n");
This removes the first five lines of the data.txt file, stores them in another array and prints the removed lines on STDOUT:
use warnings;
use strict;
use Tie::File;
my $limit = 5;
my $DataFile = 'data.txt';
tie my #lines, 'Tie::File', $DataFile or die $!;
my #keeps = splice #lines, 0, $limit;
print "$_\n" for #keeps;
untie #lines;

Comparing two arrays in Perl

I know this has been asked before, and I know there are functions to make this easy in Perl. But what I want is advice on my specific code. I want to go through each line of text which I've read from a file, and compare it to the same line from another file, printing them if they are different.
I've tried as many variations of this as I could think of, and none work. This specific code which I'm posting thinks every element in the array is different from the one in the other array.
use 5.18.2;
use strict;
use utf8;
printf "This program only compares two files.\n"
. "Here are the differences between "
. $ARGV[0] . " and " . $ARGV[1] . ":\n";
open FIRST_FH, '<', $ARGV[0];
chomp(my #file1 = <FIRST_FH>);
close FIRST_FH;
open SECOND_FH, '<', $ARGV[1];
chomp(my #file2 = <SECOND_FH>);
close SECOND_FH;
for(my $i=0; $i < scalar #file1; ++$i){
my $string = $file2[$i];
unless($_ =~ /$string/){
print "Difference found: #file1[$i], #file2[$i]\n";
}
}
use utf8; just instructs the interpreter to read your source file as UTF-8. Use the open pragma to set the default IO layers to UTF-8 (or manually specify '<:encoding(UTF-8)' as the second argument to open).
Don't use printf when print will suffice (it usually does, due to interpolation). In this particular instance, I find a heredoc to be most readable.
It's inefficient to read both files into memory. Iterate over them lazily by taking one line at a time in a while loop.
Always check if open failed and include $! in the error message. Alternatively, use autodie;, which handles this for you. Also, use lexical filehandles; they'll automatically close when they go out of scope, and won't clash with other barewords (e.g. subroutines and built-ins).
Keeping in mind these suggestions, the new code would look like:
#!/usr/bin/perl
use 5.18.2; # Implicitly loads strict
use warnings;
use open qw(:encoding(utf8) :std);
print <<"EOT";
This program only compares 2 files.
Here are the differences between
$ARGV[0] and $ARGV[1]:
EOT
open(my $file1, '<', shift) or die $!;
open(my $file2, '<', shift) or die $!;
while (my $f1_line = <$file1>, my $f2_line = <$file2>)
{
if ($f1_line ne $f2_line)
{
print $f1_line, $f2_line;
}
}
But this is still a naive algorithm; if one file has a line removed, all subsequent lines will differ between files. To properly achieve a diff-like comparison, you'll need an implementation of an algorithm that finds the longest common subsequence. Consider using the CPAN module Algorithm::Diff.
Why are you comparing using $_? Which you haven't defined anywhere?
my $string = $file2[$i];
unless($_ =~ /$string/){
Simply compare the lines using eq or ne:
if ( $file1[$i] ne $file2[$i] ) {
However, I would recommend that you make a lot of stylistic changes to your script, starting with doing line by line processing instead of slurping in the files. The following is how I would completely rewrite it:
use 5.18.2;
use strict;
use warnings;
use autodie;
use utf8;
my ( $file1, $file2 ) = #ARGV;
open my $fh1, '<', $file1;
open my $fh2, '<', $file2;
while ( !eof($fh1) && !eof($fh2) ) {
chomp( my $line1 = <$fh1> );
chomp( my $line2 = <$fh2> );
if ( line1 ne $line2 ) {
warn "Difference found on line $.:\n $line1\n $line2\n";
}
}
warn "Still more data in $file1\n" if !eof $fh1;
warn "Still more data in $file2\n" if !eof $fh2;

Popping keys of an array to calculate a total

I'm trying to simply pop off each numeric value and add them together to gain a total.
Input file:
Samsung 46
RIM 16
Apple 87
Microsoft 30
My code compiles, however, it only returns 0:
open (UNITS, 'units.txt') || die "Can't open it $!";
my #lines = <UNITS>;
my $total = 0;
while (<UNITS>) {
chomp;
my $line = pop #lines;
$line += $total;
}
print $total;
No need to slurp all lines into an array if you're just going to loop through them anyway with a while. Also, you need to split each line to get your numbers.
use warnings;
use strict;
open (UNITS, 'units.txt') || die "Can't open it $!";
my $total = 0;
while (<UNITS>) {
chomp;
my $num = (split)[1];
$total += $num;
}
print "$total\n";
__END__
179
There are three problems here
You are trying to add strings like 'Samsung 46' + 'RIM 16'
You read the entire file into #lines and then try to read more from the file in the while loop. That loop is never entered because you have already read to end of file
You are adding $total to the (undeclared) variable $line within the loop, instead of the other way around. So $total remains at zero and $line keeps having zero added to it
It is best to use while to read files unless you need something other than sequential access to the records, so removing #lines is a start.
It isn't completely clear which part of the records you want to accumulate. This program splits the lines on whitespace and adds together the last field of each line.
You must always use strict and use warnings at the start of every program. It is a measure that will make it far easier to locate bugs in your code. It is also best to use lexical file handles rather than the global one you used, and the three-parameter form of open.
use strict;
use warnings;
open my $units, '<', 'units.txt' or die "Can't open it: $!";
my $total;
while (<$units>) {
my #fields = split;
$total += $fields[-1];
}
print $total;
output
179
use strict;
use warnings;
open my $fh, "<", "units.txt" or die "well...";
my $total = 0;
while(<$fh>){
chomp;
my ($string, $num) = split(" ", $_);
$total += $num;
}
print $total;
This problem is a doddle with a one-liner:
$ perl -ane '$sum += $F[1] }{ print $sum' units.txt
Explanation
-a enables autosplit, each line is split and stored in #F
-n loops over the file line by line
-e tells perl that the next argument is to be treated as Perl code
the LHS of the Eskimo-kiss (that funny-looking }{ in the middle) is performed for every line in the input file, RHS performed only once
LHS accumulates the second column of every line in $sum
RHS prints the result of $sum once all lines have been processed

perl - cutting many strings with given array of numbers

dear my fellow perl masters in the world~!
I need your help.
I have a string file A and a number file B like this:
File A:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
...and so on till 200.
File B:
3, 6, 2, 5, 6, 1, ... 2
(total 200 numbers in an array)
then, with the numbers in file B, I would like to cut each string from the start position to the number of characters in File B.
E.g. as File B starts with 3, 6, 2 ...
File A will be
AAAAAAAAAAAAAAAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
like this.
So. this is my code so far...
use strict;
if (#ARGV != 2) {
print "Invalid usage\n";
print "Usahe: perl program.pl [num_list] [string_file]\n";
exit(0);
}
my $numbers=$ARGV[0];
my $strings=$ARGV[1];
my $i;
open(LIST,$number);
open(DATA,$strings);
my #list = <LIST>;
my $list_size = scalar #sp_list;
for ($i=0;$i<=$list_size;$i++) {
print $i,"\n";
#while (my $line = <DATA>) {
}
close(LIST);
close(DATA);
As the strings and numbers are 200 I changed the array into a scalar value to work on every numbers of every strings.
I'm working on this. and I know I suppose to use a pos function but i do not know how to match each number with each string. is reading the string first by while? or using for to know how many time that I have to run this to achieve the result?
Your help will be much appreciated!
Thank you.
I will be working on it, too. Need your feedback.
It is good that you use strict, and you should also use warnings. Further things to note:
You should check the return value of open to make sure they did not fail. You should also use the three argument form of open and use a lexical file handle. Especially when handling command line arguments, which does pose a security risk.
open my $listfh, "<", $file or die $!;
You may wish to use a safety precaution
use ARGV::readonly;
You can easily make the list of numbers with a map statement. Assuming the numbers are in a comma separated list:
my #list = map split(/\s*,\s*/), <$listfh>;
This will split the input line(s) on comma and strip excess whitespace.
When reading your input file, you do not need to use a counter variable. You can simply do
open my $inputfh, "<", $file or die $!;
while (<$inputfh>) {
my $length = shift #list; # these are your numbers
chomp; # remove newline
my $string = substr($_, 0, -$length); # negative length on substr
print "$string\n";
}
The negative length on substr makes it leave that many characters off the end of the string.
Here is a one-liner in action that demonstrates these principles:
perl -lwe '$f = pop; # save file name for later
#nums = map split(/\s*,\s*/), <>; # process first file
push #ARGV, $f; # put back file name
while (<>) {
my $len = shift #nums;
chomp;
print substr($_,0,-$len);
}' fileb.txt filea.txt
Output:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEEEEEEEEEEEE
Note the use of implicit open of file name arguments by manipulating #ARGV. Also handling newlines with -l switch.
Here is my suggestion. It does use autodie so that there is no need to explicitly check the status of open calls, and temporarily undefines $/ - the input record separator - so that all of the num_list file is read in one go. You aren't clear whether this file will always contain just single line, in which case you can omit local $/.
The numbers are extracted from the text using a regular expression /\d+/g returns all the strings of digits in the input as a list.
The second parameter to substr is the start position of the substring you want, and using a negative number counts from the end of the string instead of the beginning. The third parameter is the number of characters in the substring, and the fourth is a string to replace that substring in the target variable. So substr $data, -$n, $n, '' replaces the substring of length $n starting $n characters from the end with an empty string - i.e. it deletes it.
Note that if it is your intention to remove the given number of characters from the beginning of the string, then you would write substr $data, 0, $n, '' instead.
use strict;
use warnings;
use autodie;
unless (#ARGV == 2) {
print "Usage: perl program.pl [num_list] [string_file]\n";
exit;
}
my #numbers;
{
open my $listfh, '<', $ARGV[0];
local $/;
my $numbers = <$listfh>;
#numbers = $numbers =~ /\d+/g;
};
open my $datafh, '<', $ARGV[1];
for my $i (0 .. $#numbers) {
print "$i\n";
my $n = $numbers[$i];
my $data = <$datafh>;
chomp $data;
substr $data, -$n, $n, '';
print "$data\n";
}
Here is how I would do it. substr is the function to remove a part of a string. From your example, it is not clear whether you want to remove the characters at the beginning or at the end. Both alternatives are shown here:
#!/usr/bin/perl
use warnings;
use strict;
if (#ARGV != 2) {
die "Invalid usage\n"
. "Usage: perl program.pl [num_list] [string_file]\n";
}
my ($number_f, $string_f) = #ARGV;
open my $LIST, '<', $number_f or die "Cannot open $number_f: $!";
my #numbers = split /, */, <$LIST>;
close $LIST;
open my $DATA, '<', $string_f or die "Cannot open $string_f: $!";
while (my $string = <$DATA>) {
substr $string, 0, shift #numbers, q(); # Replace the first n characters with an empty string.
# To remove the trailing portion, replace the previous line with the following:
# my $n = shift #numbers;
# substr $string, -$n-1, $n, q();
print $string;
}
You were not checking the return value of open. Try to remember to always do that.
Do not declare variables far before you are going to use them ($i here).
Do not use C-style for loops if you do not have to. They are prone to fence post errors.
You can use substr():
use strict;
use warnings;
if (#ARGV != 2) {
print "Invalid usage\n";
print "Usage: perl program.pl [num_list] [string_file]\n";
exit(0);
}
my $numbers=$ARGV[0];
my $strings=$ARGV[1];
open my $list, '<', $numbers or die "Can't open $numbers: $!";
open my $data, '<', $strings or die "Can't open $strings: $!";
chomp(my $numlist = <$list>);
my #numbers = split /\s*,\s*/,$numlist;
for my $chop_length (#numbers)
{
my $data = <$data> // die "not enough data in $strings";
print substr($data,0,length($data)-$chop_length)."\n";
}
Your specs say you want "... to cut each string from the start position to the number of characters in File B." I agree with choroba that it's not perfectly clear whether characters from the start or the end of the string are to be cut. However, I tend to think that you want to remove characters from the beginning when you say, "... from the start position ...", but a string like ABCDEFGHIJKLMNOPQRSTUVWXYZ012345 would help clarify this issue.
This option is not as well self-documenting as the other solutions, but a discussion of it will follow:
use strict;
use warnings;
#ARGV == 2 or die "Usage: perl program.pl [num_list] [string_file]\n";
open my $fh, '<', pop or die "Cannot open string file: $!";
chomp( my #str = <$fh> );
local $/ = ', ';
while (<>) {
chomp;
print +( substr $str[ $. - 1 ], $_ ) . "\n";
}
Strings:
ABCDEFGHIJKLMNOPQRSTUVWXYZ012345
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Numbers:
3, 6, 2, 5, 6
Output:
DEFGHIJKLMNOPQRSTUVWXYZ012345
BBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEEEEEEEEEEEE
The strings' file name is poped off #ARGV (since an explicit argument for pop is not used) and passed to open to read the strings into #str. The record separator is set to ', ' so chomp leaves only the number. The current line number in $. is used as part of the index to the corresponding #str element, and the remaining characters in the string from n on are printed.

Resources