perl: More concise way to branch based on whether split succeeded? - arrays

I know split returns the number of fields parsed, if it assigned to a scalar; and returns an array if assigned to an array.
Is there a way to check whether a line is successfully parsed without having to call split twice (once to check how many fields were parsed, and, if the correct number of fields were parsed, a second time to return the fields in an array)?
foreach (#lines) {
if ( split ) {
my ($ipaddr, $hostname) = split;
}
}
.. I need to check whether the split succeeded in order to avoid later uninitialized references to $ipaddr and $hostname. Just seems like I ought to be able to combine the two calls to split into a single call.

Sure:
foreach (#lines) {
if (2 == (my ($ipaddr, $hostname) = split)) {
# Got exactly two fields
}
}
So if you just want to skip bad lines, you can simply use:
foreach (#lines) {
2 == (my ($ipaddr, $hostname) = split)
or next;
# Got exactly two fields
}
Don't forget to remove trailing whitespace from your lines first (such as by using chomp to remove line feeds) or it will mess up your field count.
You can change the == to <= if there might be more fields.

I think I would prefer a regex match:
for ( #lines ) {
next unless my ($ipaddr, $hostname) = /(\S+)\s+(\S+)/;
# use $ipaddr & $hostname
}
This is different from the original in that it will succeed if more than two non-space substrings are found, but a fix is simple if it is necessary.

Related

How do I make the STDIN in the Array stop? (PERL) [duplicate]

This question already has answers here:
Perl script does not print <STDIN> multiple times
(2 answers)
Closed 5 years ago.
My #array will not stop taking in STDIN...
my #array = undef;
while (#array = undef){
#array = <STDIN>;
for (#array[x]=5){
#array = defined;
}
}
As clarified, limit the STDIN to five lines
use warnings;
use strict;
use feature 'say';
my #input;
while (<STDIN>) {
chomp;
push #input, $_;
last if #input == 5;
}
say for #input;
There are other things to comment on in the posted code. While a good deal of it is cleared up in detail in Dave Cross answer, I'd like to address the business of context when reading from a filehandle.
The "diamond" operator <> is context-aware. From I/O Operators (perlop)
If a <FILEHANDLE> is used in a context that is looking for a list, a list comprising all input lines is returned, one line per list element. It's easy to grow to a rather large data space this way, so use with care.
In the usual while loop the <> is in the scalar context
while (my $line = <$fh>)
and it is the same with while (<$fh>) since it assigns to $_ variable, a scalar, by default.
But if we assign to an array, say from a filehandle $fh with which a file was opened
my #lines = <$fh>;
then <> operator works in the list context. It reads all lines until it sees EOF (end-of-file), at which point it returns all lines, which are assigned to #lines. Remember that each line has its newline. You can remove them all by
chomp #lines;
since chomp works on a list as well.
With STDIN this raises an issue when input comes from keyboard, as <> waits for more input since EOF isn't coming on its own. It is usually given as Ctrl+D† on Unixy systems (Ctrl+Z on Windows).
So you can, in principle, have #array = <STDIN> and quit input with Ctrl+D but this may be a little awkward for input expected from keyboard, as it mostly implies the need for line by line processing. It is less unusual if STDIN comes from a file,
script.pl < input.txt
or a pipe on the command line
some command with output | script.pl
where we do get an EOF (courtesy of EOT).
But I'd still stick to a customary while when reading STDIN, and process it line by line.
† The Ctrl+D is how this is usually referred to but one actually types a low-case d with Ctrl. Note that Ctrl and c (labeled as Ctrl+C) does something entirely different; it sends the SIGINT signal, which terminates the whole program if not caught.
my #array = undef;
while (#array = undef){
These two lines don't do what (I assume) you think they are doing.
my #array = undef;
This defines an array with a single element which is the special value undef. I suspect that what you actually wanted was:
my #array = ();
which creates an empty array. But Perl arrays are always empty when first created, so this can be simplified to:
my #array;
The second line repeats that error and adds a new one.
while (#array = undef) {
I suspect you want to check for an empty array here and you were reaching for something that meant something like "if #array is undefined). But you missed the fact that in Perl, assignment operators (like =) are different to comparison operators (like ==). So this line assigns undef to #array rather than comparing it. You really wanted #array == undef - but that's not right either.
You need to move away from this idea of checking that an array is "defined". What you're actually interested in is whether an array is empty. And Perl has a clever trick that helps you work that out.
If you use a Perl array in a place where Perl expects to see a single (scalar) value, it gives you the number of elements in the array. So you can write code like:
my $number_of_elements = #an_array;
The boolean logic check in an if or while condition is a single scalar value. So if you want to check if an array contains any elements, you can use code like this:
if (#array) {
# #array contains data
} else {
# #array is empty
}
And to loop while an array contains elements, you can simply write:
while (#array) {
# do something
}
But here, you want to do something while your array is empty. To do that, you can either invert the while condition logic (using ! for "not"):
while (!#array) {
# do something
}
Or you can switch to using an until test (which is the opposite of while):
until (#array) {
# do something
}
I'm going to have to stop there. I hope this gives you some insight into what is wrong with your code. I'm afraid that this level of wrongness permeates the rest of your code too.

Dereferencing an array from an array of arrays in perl

I have various subroutines that give me arrays of arrays. I have tested them separately and somehow when i write my main routine, I fail to make the program recognize my arrays. I know it's a problem of dereferencing, or at least i suspect it heavily.
The code is a bit long but I'll try to explain it:
my #leaderboard=#arrarraa; #an array of arrays
my $parentmass=$spect[$#spect]; #scalar
while (scalar #leaderboard>0) {
for my $i(0..(scalar #leaderboard-1)) {
my $curref=$leaderboard[$i]; #the program says here that there is an uninitialized value. But I start with a list of 18 elements.
my #currentarray=#$curref; #then i try to dereference the array
my $w=sumaarray (#currentarray);
if ($w==$parentmass) {
if (defined $Leader[0]) {
my $sc1=score (#currentarray);
my $sc2=score (#Leader);
if ($sc1>$sc2) {
#Leader=#currentarray;
}
}
else {#Leader=#currentarray;}
}
elsif ($w>$parentmass) {splice #leaderboard,$i,1;} #here i delete the element if it doesn't work. I hope it's done correctly.
}
my $leadref= cut (#leaderboard); #here i take the first 10 scores of the AoAs
#leaderboard = #$leadref;
my $leaderef=expand (#leaderboard); #then i expand the AoAs by one term
#leaderboard= #$leaderef; #and i should end with a completely different list to work with in the while loop
}
So I don't know how to dereference the AoAs correctly. The output of the program says:
"Use of uninitialized value $curref in concatenation (.) or string at C:\Algorithms\22cyclic\cyclospectrumsub.pl line 183.
Can't use an undefined value as an ARRAY reference at C:\Algorithms\22cyclic\cyclospectrumsub.pl line 184."
I would appreciate enormously any insight or recommendation.
The problem is with the splice that modifies the list while it is being processed. By using the 0..(scalar #leaderboard-1) you set up the range of elements to process at the beginning, but when some elements are removed by the splice, the list ends up shorter than that and once $i runs off the end of the modified list you get undefined references.
A quick fix would be to use
for (my $i = 0; $i < #leaderboard; $i++)
although that's neither very idiomatic nor efficient.
Note that doing something like $i < #leaderboard or #leaderboard-1 already provides scalar context for the array variable, so you don't need the scalar() call, it does nothing here.
I'd probably use something like
my #result;
while(my $elem = shift #leaderboard) {
...
if ($w==$parentmass) {
# do more stuff
push #result, $elem;
}
}
So instead of deleting from the original list, all elements would be taken off the original and only the successful (by whatever criterion) ones included in the result.
There seem to be two things going on here
You're removing all arrays from #leaderboard whose sumaarray is greater than $parentmass
You're putting in #Leader the array with the highest score of all the arrays in #leaderboard whose sumaarray is equal to $parentmass
I'm unclear whether that's correct. You don't seem to handle the case where sumaarray is less than $parentmass at all. But that can be written very simply by using grep together with the max_by function from the List::UtilsBy module
use List::UtilsBy 'max_by';
my $parentmass = $spect[-1];
my #leaderboard = grep { sumaarray(#$_) <= $parentmass } #arrarraa;
my $leader = max_by { score(#$_) }
grep { sumaarray(#$_) == $parentmass }
#leaderboard;
I'm sure this could be made a lot neater if I understood the intention of your algorithm; especially how those elements with a sumarray of less that $parentmass

Search and save a line containing a set of chars from an array

I'm using readlines() to read the data into my two arrays, and I need to get a specific line.
problem:
I'm not sure what string or array method to use to get the specific line containing '...blah' (my class' line/name of the group.)
I prefer to not save the entire file content into these two variables if possible. I'd rather just "search" through the file and save the two specific lines I need into student_info and class_usernames.
student_info = File.readlines ('/etc/passwd')
class_usernames = File.readlines ('/etc/group')
You could use detect on that array:
student_info = File.readlines('/etc/passwd').detect { |s| s.include?('...blah') }
class_usernames = File.readlines('/etc/group').detect { |s| s =~ /\.\.\.blah/ }
And btw please do not use a whitespace between the method name and the parameter list.
student_info = File.readlines('/etc/passwd').grep(/blah/).first
grep is also a method from Enumerable. It returns an array of matches, of which the first is taken (by .first).

Is there an array version of $1...$NF in awk?

Consider the following function which is currently in the public domain.
function join(array, start, end, sep, result, i)
{
if (sep == "")
sep = " "
else if (sep == SUBSEP) # magic value
sep = ""
result = array[start]
for (i = start + 1; i <= end; i++)
result = result sep array[i]
return result
}
I would like to use this function join contiguous columns such as $2, $3, $4 where the start and end ranges are variables.
However, in order to do this, I must first convert all the fields into an array using a loop like the following.
for (i = 1; i <= NF; i++) {
a[i] = $i
}
Or the shorter version, as #StevenPenny mentioned.
split($0, a)
Unfortunately both approaches require the creation of a new variable.
Does awk have a built-in way of accessing the columns as an array so that the above manual conversions are not necessary?
No such array is defined in POSIX awk (the only array type special variables are ARGV and ENVIRON).
None exists in gawk either, though it adds PROCINFO, SYMTAB and FUNCTAB special arrays. You can check all the defined variables and types at runtime using the SYMTAB array (gawk-4.1.0 feature):
BEGIN { PROCINFO["sorted_in"]="#ind_str_asc" } # automagic sort for "in"
{ print $0 }
END { for (ss in SYMTAB) printf("%-12s: %s\n",PROCINFO["identifiers"][ss],ss) }
(though you will find that SYMTAB and FUNCTAB themselves are missing from the list, and missing from --dump-variables too, they are treated specially by design).
gawk also offers a few standard loadable extensions, none implements this feature though (and given the dynamic relation ship between $0, $1..., NF and OFS, an array that had the same functionality would be a little tricky to implement).
As suggested by Jidder, one work-around is to skip the array altogether and use fields. There's nothing special about the field names, a variable $n can be used the same as a literal like $1 (just take care to use braces for precedence in expressions like $(NF-1). Here's an fjoin function which works on fields rather than an array:
function fjoin(start,end,sep, result,ii) {
if (sep=="") sep=" "
else if (sep==SUBSEP) sep =""
result=$start
for (ii=start+1; ii<=end; ii++)
result=result sep $ii
return result
}
{ print "2,4: " fjoin(2,4,":") }
(this does not treat $0 as a special case)
Or just use split() and be happy, gawk at least guarantees that it behaves identically to field splitting (assuming that none of FS, FIELDWIDTHS and possibly IGNORECASE are being modified so as to change the behaviour).
this is what i do in my own code
function iter0gen() {
PROCINFO["sorted_in"] = "#ind_num_asc"; # skip this for mawk
return split(sprintf("%0"(NF)"d", 0), iter0, //)
}
since splitting by null string means 1-char per bin, then just split a string of zeros with length equal to that of NF, create an array called iter0, then you can do
for (x in iter0) { $(x) = do stuff….. }
This is only for if you need a lazy iterator. The plus side of this is that since indices begin at 1 by default, u can't accidentally get $0 in the iterator loop. The down side of this is that if you're not careful, you would've switched all the input FS into OFS the moment you assign into any field, and this doesn't pre-backup $0 on ur behalf.
if you just want the columns, then just do standard split() of the array using FS. If you're using gawk and would like the seps array too, then add that optional 4th argument that's non-portable.

re-initializing awk array created by split

I'm trying to use split to reverse the order of characters in a string that appears as the second field in a file with many such lines. The command:
{
n=split($2,arr," ");
for(i=1;i<=n;i++)
s=arr[i] s
}
{ print s }
does this for one line. However, the arr array (and n) seem immortal, so that when I embed this code into an awk script to process multiple lines, the output corresponding to the field I want reversed accumulates (and reverses) all previous lines:
1_B.pdb
GGTGYPGLKDKDDNEGTKYNKLLNATLIVTDVGNTIRTECPDVNRG
AARS_0001_B.pdb
GGTGYPGLKDKDDNEGTKYNKLLNATLIVTDVGNTIRTECPDVNRGGGTGYPGLKDKDDNEGTKYNKLLNATLIVTDVGNTIRTECPDVNRG
AARS_0002_B.pdb
GLILYDGFLDKRDLEGLKYNDILNRTKDVTDVGNTTRTECPDVNRKGGTGYPGLKDKDDNEGTKYNKLLNATLIVTDVGNTIRTECPDVNRGGGTGYPGLKDKDDNEGTKYNKLLNATLIVTDVGNTIRTECPDVNRG
AARS_0003_B.pdb
DGCSLDGFTDDRDLKGALYNKILNKTLIVTDVGNTTRTEVCEKDRYGLILYDGFLDKRDLEGLKYNDILNRTKDVTDVGNTTRTECPDVNRKGGTGYPGLKDKDDNEGTKYNKLLNATLIVTDVGNTIRTECPDVNRGGGTGYPGLKDKDDNEGTKYNKLLNATLIVTDVGNTIRTECPDVNRG
This appears to me to be a problem with re-initialization. I've tried to delete all previous elements of arr[] and to reset n to 0, without any effect. What do I need to do?
It's not arr that's immortal, it's s since you never [re-]init it to "" outside of the loop. arr is getting re-inited on every call to split().
Try this:
{
n=split($2,arr,/ /)
s=""
for(i=1;i<=n;i++)
s=arr[i] s
print s
}
The 3rd arg for split(), by the way is a field separator, not a string, and a field separator is a regexp with a couple of extra properties so the correct way to call split with a fixed "string" is using RE delimiters split($2,arr,/ /), not string delimiters split($2,arr," "). It doesn't make a functional difference in this case but it does when the field separator gets more complicated so best to get used to doing it the right way.
Bonus round: you would not need to explicitly re-init s if you put that code in a function:
function rev(str, arr,n,s,i) {
n=split(str,arr,/ /)
for(i=1;i<=n;i++)
s=arr[i] s
return s
}
...
{ print rev($2) }
Reason left as an exercise :-).

Resources