Remove blank regex hits from an array

Remove blank regex hits from an array - arrays

I am performing a regex search and wishing to print out only the hits in fasta format (two lines of data: the first beginning with a carrot ">" followed by the hit and the second line without a carrot, but still containing the hit information).
I can successfully generate an output multifasta file, but the carrot and line breaks are included in the output file whether there is a hit or not.
Generated output:
>
>
>TAGCTAGC
TAGCTAGC
>
>GCTAGCTA
GCTAGCTA
Desired output:
>TAGCTAGC
TAGCTAGC
>GCTAGCTA
GCTAGCTA
Here is my code:
#!/usr/bin/perl
use warnings;
use strict;
open(CLUSTER, ">", "SequencesToCluster.txt") or die $!;
my #TrimmedSequences;
my #ArrayofFiles = glob ("~/BLASTdb/Individual_Sequences_*");
foreach my $file (#ArrayofFiles){
open (my $sequence, $file) or die "can't open file: $!";
while (my $line = <$sequence>){
if ($line !~/^>/){
my $seq = $line;
$seq =~ s/\R//g;
$seq =~ m/([TAGC]{16})(CGGAGCTTTA|GCCATTTCT|TAAAGCTCCG|AGAAATGGGC/;
push(#TrimmedSequences, ">", $1, "\n", $1, "\n");
}
}
}
#Here I believe I need to manipulate the array to get rid of blank fastas
print CLUSTER #TrimmedSequences;

If you're filtering an array, the tool is grep.
E.g.
my #new_array = grep { not /^\s*$/ } #old_array;
This will filter any element that is just whitespace. In your case, since it's empty or just a >:
/^>?\s*$/ instead.
However, that's fixing a problem that need not exist in the first place. You could instead:
$seq =~ m/([TAGC]{16})(CGGAGCTTTA|GCCATTTCT|TAAAGCTCCG|AGAAATGGGC)/
&& push(#TrimmedSequences, ">", $1, "\n", $1, "\n");
And that will only push if the regex matches.

Related

Perl - push lines inbetween regex into one element of array

This is the log file I am dealing with -
|
blah1a
blah1b
blah1c
|
****blahnothing1
|
blah2a
blah2b
blah2c
|
blahnothing2
|
blah3a
blah3b
blah3c
|
blahnothing3
The information that I need is nestled between two pipe characters. There are alot of lines with that start with asteriks, I skip over them. Each line has windows end of line characters. The data in between these pipe characters is contigious, but when read on a linux host, it is chopped up with the windows new lines.
I wrote the perl script with a range operator between the two lines hoping that everything that starts with a pipe delimiter would get pushed into an array element and then stop at the next pipe delimiter, then start again. Each array element would have all the lines in between the two pipes characters.
Ideally the arrays would look like this, sans the windows control characters.
$lines[0] blah1a blah1b blah1c
$lines[1] blah2a blah2b blah2c
$lines[2] blah3a blah3b blah3c
However each arrays do not look like that.
#!/usr/bin/perl
use strict ;
use warnings ;
my $delimiter = "|";
my $filename = $ARGV[0] ;
my #lines ;
open(my $fh, '<:encoding(UTF-8)' , $filename) or die "could not open file $filename $!";
while (my $line = readline $fh) {
next if ($line =~/^\*+/) ;
if ($line =~ /$delimiter/ ... $line =~/$delimiter/) {
push (#lines, $line) ;
}
}
print $lines[0] ;
print $lines[1] ;
print $lines[2] ;

This seems to satisfy your requirement
I've left the two lines blahnothing2 and blahnothing3 in place as I couldn't see a rationale for removing them
The \R regex pattern is the generic newline, and matches the newline sequences from any platform, i.e. CR, LF, or CRLF
use strict;
use warnings 'all';
my $data = do {
open my $fh, '<:raw', 'blah.txt' or die $!;
local $/;
<$fh>;
};
$data =~ s/^\s*\*.*\R/ /gm; # Remove lines starting with *
$data =~ s/\R/ /g; # Change all line endings to spaces
# Split on pipe and remove blank elements
my #data = grep /\S/, split /\s*\|\s*/, $data;
use Data::Dump;
dd \#data;
output
[
"blah1a blah1b blah1c",
"blah2a blah2b blah2c",
"blahnothing2",
"blah3a blah3b blah3c",
"blahnothing3 ",
]

It seems that you want to merge lines between |, into a string, which gets placed on an array.
One way is to set the | as input record separator, so read a chunk between pipes each time
{ # localize the change to $/
local $/ = "|";
open(my $fh, '<:encoding(UTF-8)' , $filename)
or die "could not open file $filename $!";
my #records;
while (my $section = <$fh>)
{
next if $section =~ /^\s*\*/;
chomp $section; # remove the record separator (| here)
$section =~ s/\R/ /g; # clean up newlines
$section =~ s/^\s*//; # clean up leading spaces
push #records, $section if $section;
}
print "$_\n" for #records;
}
I skip a "section" if it starts with * (and an optional space). There can be more restrictive versions. The $section can end up being an emtpy string, so we push it on the array conditionally.
Output, with the example in the question copy-pasted into the input file with $filename
blah1a blah1b blah1c
blah2a blah2b blah2c
blahnothing2
blah3a blah3b blah3c
blahnothing3
The approach in the question is fine, but you need to merge lines within a "section" (between pipes) and place each such string on the array. So you need a flag to track when enter/leave a section.

How to get the data of each line from a file?

Here, I want to print the data in each line as 3 separate values with ":" as separator. The file BatmanFile.txt has the following details:
Bruce:Batman:bat#bat.com
Santosh:Bhaskar:santosh#santosh.com
And the output I expected was:
Bruce
Batman
bat#bat.com
Santosh
Bhaskar
santosh#santosh.com
The output after executing the script was:
Bruce
Batman
bat#bat.com
Bruce
Batman
bat#bat.com
Please explain me what I am missing here:
use strict;
use warnings;
my $file = 'BatmanFile.txt';
open my $info, $file or die "Could not open $file: $!";
my #resultarray;
while( my $line = <$info>) {
#print $line;
chomp $line;
my #linearray = split(":", $line);
push(#resultarray, #linearray);
print join("\n",$resultarray[0]),"\n";
print join("\n",$resultarray[1]),"\n";
print join("\n",$resultarray[2]),"\n";
}
close $info;

You are looping through file line by line. You have stored all lines (after splitting) in an array. Once the loop finishes you have all data in resultarray array, just print whole array after the loop (instead of printing just first 3 indexes which are you doing at the moment).
#!/usr/bin/perl
use strict;
use warnings;
my #resultarray;
while( my $line = <DATA>){
chomp $line;
my #linearray = split(":", $line);
push #resultarray, #linearray;
}
print "$_\n" foreach #resultarray;
__DATA__
Bruce:Batman:bat#bat.com
Santosh:Bhaskar:santosh#santosh.com
Demo
You can avoid all variables and do something like below
while(<DATA>){
chomp;
print "$_\n" foreach split(":");
}
One liner:
perl -e 'while(<>){chomp; push #result, split(":",$_);} print "$_\n" foreach #result' testdata.txt

When you do:
push(#resultarray, #linearray);
you're pushing #linearray into #resultarray at the end, so index 0 through 2 is still the items from the first time you pushed #linearray.
To overwrite #resultarray with the values from the second iteration, do:
#resultarray = #linearray;
instead.
Alternatively, use unshift to place #linearray at the start of #resultarray, as suggested by Sobrique:
unshift #resultarray, #linearray;

So, you just want to transliterate : to \n?
$ perl -pe 'tr/:/\n/' data.txt
Output:
Bruce
Batman
bat#bat.com
Santosh
Bhaskar
santosh#santosh.com

use strict;
use warnings;
my $file = 'BatmanFile.txt';
open my $info, $file or die "Could not open $file: $!";
my #resultarray;
while( my $line = <$info>) {
#print $line;
chomp $line;
my #linearray = split(":", $line);
#push(#resultarray, #linearray);
print join("\n",$linearray[0]),"\n";
print join("\n",$linearray[1]),"\n";
print join("\n",$linearray[2]),"\n";
}
close $info;

Popping keys of an array to calculate a total

I'm trying to simply pop off each numeric value and add them together to gain a total.
Input file:
Samsung 46
RIM 16
Apple 87
Microsoft 30
My code compiles, however, it only returns 0:
open (UNITS, 'units.txt') || die "Can't open it $!";
my #lines = <UNITS>;
my $total = 0;
while (<UNITS>) {
chomp;
my $line = pop #lines;
$line += $total;
}
print $total;

No need to slurp all lines into an array if you're just going to loop through them anyway with a while. Also, you need to split each line to get your numbers.
use warnings;
use strict;
open (UNITS, 'units.txt') || die "Can't open it $!";
my $total = 0;
while (<UNITS>) {
chomp;
my $num = (split)[1];
$total += $num;
}
print "$total\n";
__END__
179

There are three problems here
You are trying to add strings like 'Samsung 46' + 'RIM 16'
You read the entire file into #lines and then try to read more from the file in the while loop. That loop is never entered because you have already read to end of file
You are adding $total to the (undeclared) variable $line within the loop, instead of the other way around. So $total remains at zero and $line keeps having zero added to it
It is best to use while to read files unless you need something other than sequential access to the records, so removing #lines is a start.
It isn't completely clear which part of the records you want to accumulate. This program splits the lines on whitespace and adds together the last field of each line.
You must always use strict and use warnings at the start of every program. It is a measure that will make it far easier to locate bugs in your code. It is also best to use lexical file handles rather than the global one you used, and the three-parameter form of open.
use strict;
use warnings;
open my $units, '<', 'units.txt' or die "Can't open it: $!";
my $total;
while (<$units>) {
my #fields = split;
$total += $fields[-1];
}
print $total;
output
179

use strict;
use warnings;
open my $fh, "<", "units.txt" or die "well...";
my $total = 0;
while(<$fh>){
chomp;
my ($string, $num) = split(" ", $_);
$total += $num;
}
print $total;

This problem is a doddle with a one-liner:
$ perl -ane '$sum += $F[1] }{ print $sum' units.txt
Explanation
-a enables autosplit, each line is split and stored in #F
-n loops over the file line by line
-e tells perl that the next argument is to be treated as Perl code
the LHS of the Eskimo-kiss (that funny-looking }{ in the middle) is performed for every line in the input file, RHS performed only once
LHS accumulates the second column of every line in $sum
RHS prints the result of $sum once all lines have been processed

printing arrays to text file (v2)

yesterday i asked this question and got many helpful replies, hoping the same will be true today. Here is my revised script.
#! /usr/bin/perl
use strict;
use warnings;
my $line;
my #array;
my $print;
open (OUT , ">","output.txt")or die "cant open: $!";
while ($line = <>){
chomp($line);
push(#array, $line);
if(#array == 250){
$print = print "[", join(",",#array), "]", "\n";
print OUT $print;
#array = []
}
}
Originally i simply stated that i need to print out the first 250 elements of an array to a text file, and that this array was built from standard input. What i didnt state is that the input, from which the array is built, may consist of several thousand lines. The reason for creating an array from this input is so that i can limit the size of the array to 250 entities, and then print the array as a formatted string. I then need to flush the array and resume building at what will become the 251st line of the input, and continue doing this process for the remainder of the input. Lets say the input is 5k lines, i want my output to be a text file containing the original 5k lines of input, but divided into strings made up of 250 array entities.
Currently the script is just printing the array to the screen and inside "ouput.txt" is a single line reading: 1111111.

$print = print "[", join(",",#array), "]", "\n";
print OUT $print;
Here you are assigning the return value of print to $print, which is 1, because the printing is successful. What you want to do is this:
print OUT "[", join(",",#array), "]", "\n";

Here's another option which just slightly modifies your script:
use strict;
use warnings;
my #array;
local $" = ',';
while (<>) {
chomp;
push #array, $_;
if ( #array == 250 ) {
print "[#array]\n";
undef #array;
}
}
Usage: perl script.pl Infile [>outFile]
The last, optional parameter directs output to a file.
The variable $" is holds Perl's list separator that's applied to an interpolated array, so join's not needed here.
Hope this helps!

#! /usr/bin/perl
use strict;
use warnings;
my $line;
my #array;
my $print;
open (OUT , ">","moloch_chunker_output.txt")or die "cant open: $!";
while ($line = <>){
chomp($line);
push(#array, $line);
if(#array == 250){
print OUT "[", join(",",#array), "]", "\n";
#array = ();
}
}
Thank you davs, that fixed it.

perl - cutting many strings with given array of numbers

dear my fellow perl masters in the world~!
I need your help.
I have a string file A and a number file B like this:
File A:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
...and so on till 200.
File B:
3, 6, 2, 5, 6, 1, ... 2
(total 200 numbers in an array)
then, with the numbers in file B, I would like to cut each string from the start position to the number of characters in File B.
E.g. as File B starts with 3, 6, 2 ...
File A will be
AAAAAAAAAAAAAAAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
like this.
So. this is my code so far...
use strict;
if (#ARGV != 2) {
print "Invalid usage\n";
print "Usahe: perl program.pl [num_list] [string_file]\n";
exit(0);
}
my $numbers=$ARGV[0];
my $strings=$ARGV[1];
my $i;
open(LIST,$number);
open(DATA,$strings);
my #list = <LIST>;
my $list_size = scalar #sp_list;
for ($i=0;$i<=$list_size;$i++) {
print $i,"\n";
#while (my $line = <DATA>) {
}
close(LIST);
close(DATA);
As the strings and numbers are 200 I changed the array into a scalar value to work on every numbers of every strings.
I'm working on this. and I know I suppose to use a pos function but i do not know how to match each number with each string. is reading the string first by while? or using for to know how many time that I have to run this to achieve the result?
Your help will be much appreciated!
Thank you.
I will be working on it, too. Need your feedback.

It is good that you use strict, and you should also use warnings. Further things to note:
You should check the return value of open to make sure they did not fail. You should also use the three argument form of open and use a lexical file handle. Especially when handling command line arguments, which does pose a security risk.
open my $listfh, "<", $file or die $!;
You may wish to use a safety precaution
use ARGV::readonly;
You can easily make the list of numbers with a map statement. Assuming the numbers are in a comma separated list:
my #list = map split(/\s*,\s*/), <$listfh>;
This will split the input line(s) on comma and strip excess whitespace.
When reading your input file, you do not need to use a counter variable. You can simply do
open my $inputfh, "<", $file or die $!;
while (<$inputfh>) {
my $length = shift #list; # these are your numbers
chomp; # remove newline
my $string = substr($_, 0, -$length); # negative length on substr
print "$string\n";
}
The negative length on substr makes it leave that many characters off the end of the string.
Here is a one-liner in action that demonstrates these principles:
perl -lwe '$f = pop; # save file name for later
#nums = map split(/\s*,\s*/), <>; # process first file
push #ARGV, $f; # put back file name
while (<>) {
my $len = shift #nums;
chomp;
print substr($_,0,-$len);
}' fileb.txt filea.txt
Output:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEEEEEEEEEEEE
Note the use of implicit open of file name arguments by manipulating #ARGV. Also handling newlines with -l switch.

Here is my suggestion. It does use autodie so that there is no need to explicitly check the status of open calls, and temporarily undefines $/ - the input record separator - so that all of the num_list file is read in one go. You aren't clear whether this file will always contain just single line, in which case you can omit local $/.
The numbers are extracted from the text using a regular expression /\d+/g returns all the strings of digits in the input as a list.
The second parameter to substr is the start position of the substring you want, and using a negative number counts from the end of the string instead of the beginning. The third parameter is the number of characters in the substring, and the fourth is a string to replace that substring in the target variable. So substr $data, -$n, $n, '' replaces the substring of length $n starting $n characters from the end with an empty string - i.e. it deletes it.
Note that if it is your intention to remove the given number of characters from the beginning of the string, then you would write substr $data, 0, $n, '' instead.
use strict;
use warnings;
use autodie;
unless (#ARGV == 2) {
print "Usage: perl program.pl [num_list] [string_file]\n";
exit;
}
my #numbers;
{
open my $listfh, '<', $ARGV[0];
local $/;
my $numbers = <$listfh>;
#numbers = $numbers =~ /\d+/g;
};
open my $datafh, '<', $ARGV[1];
for my $i (0 .. $#numbers) {
print "$i\n";
my $n = $numbers[$i];
my $data = <$datafh>;
chomp $data;
substr $data, -$n, $n, '';
print "$data\n";
}

Here is how I would do it. substr is the function to remove a part of a string. From your example, it is not clear whether you want to remove the characters at the beginning or at the end. Both alternatives are shown here:
#!/usr/bin/perl
use warnings;
use strict;
if (#ARGV != 2) {
die "Invalid usage\n"
. "Usage: perl program.pl [num_list] [string_file]\n";
}
my ($number_f, $string_f) = #ARGV;
open my $LIST, '<', $number_f or die "Cannot open $number_f: $!";
my #numbers = split /, */, <$LIST>;
close $LIST;
open my $DATA, '<', $string_f or die "Cannot open $string_f: $!";
while (my $string = <$DATA>) {
substr $string, 0, shift #numbers, q(); # Replace the first n characters with an empty string.
# To remove the trailing portion, replace the previous line with the following:
# my $n = shift #numbers;
# substr $string, -$n-1, $n, q();
print $string;
}
You were not checking the return value of open. Try to remember to always do that.
Do not declare variables far before you are going to use them ($i here).
Do not use C-style for loops if you do not have to. They are prone to fence post errors.

You can use substr():
use strict;
use warnings;
if (#ARGV != 2) {
print "Invalid usage\n";
print "Usage: perl program.pl [num_list] [string_file]\n";
exit(0);
}
my $numbers=$ARGV[0];
my $strings=$ARGV[1];
open my $list, '<', $numbers or die "Can't open $numbers: $!";
open my $data, '<', $strings or die "Can't open $strings: $!";
chomp(my $numlist = <$list>);
my #numbers = split /\s*,\s*/,$numlist;
for my $chop_length (#numbers)
{
my $data = <$data> // die "not enough data in $strings";
print substr($data,0,length($data)-$chop_length)."\n";
}

Your specs say you want "... to cut each string from the start position to the number of characters in File B." I agree with choroba that it's not perfectly clear whether characters from the start or the end of the string are to be cut. However, I tend to think that you want to remove characters from the beginning when you say, "... from the start position ...", but a string like ABCDEFGHIJKLMNOPQRSTUVWXYZ012345 would help clarify this issue.
This option is not as well self-documenting as the other solutions, but a discussion of it will follow:
use strict;
use warnings;
#ARGV == 2 or die "Usage: perl program.pl [num_list] [string_file]\n";
open my $fh, '<', pop or die "Cannot open string file: $!";
chomp( my #str = <$fh> );
local $/ = ', ';
while (<>) {
chomp;
print +( substr $str[ $. - 1 ], $_ ) . "\n";
}
Strings:
ABCDEFGHIJKLMNOPQRSTUVWXYZ012345
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Numbers:
3, 6, 2, 5, 6
Output:
DEFGHIJKLMNOPQRSTUVWXYZ012345
BBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEEEEEEEEEEEE
The strings' file name is poped off #ARGV (since an explicit argument for pop is not used) and passed to open to read the strings into #str. The record separator is set to ', ' so chomp leaves only the number. The current line number in $. is used as part of the index to the corresponding #str element, and the remaining characters in the string from n on are printed.