perl: input multiple files as array - arrays

I have 5 files in different directory. I am extracting the data's from all files and make it as new file.
Note: input each file as an array and extract the data by using for loop for each n every files. I want to make it as single for loop to take the files and process the rest
For file1 am using
foreach (#file)
{
my #temp = split(/\t/, trim($_));
push(#output, $temp[0] . "\t" . $temp[1] . "\n");
}
foreach(uniq(#output))
{
print $OUTPUTFILE $_;
}
I am doing this for five times to process five file. Can anyone help me on how to make it simple

Just wrap it in an outer loop, iterating over all five files:
for my $file ( #five_files ) {
open my $fh, '<', $file or die "Unable to open $file: $!";
my #file = <$fh>;
foreach (#file) {
my #temp = split(/\t/, trim($_));
push(#output, $temp[0] . "\t" . $temp[1] . "\n");
}
foreach(uniq(#output)) {
print $OUTPUTFILE $_;
}
}
Since you're interested in just the first two elements of #temp, the foreach #file loop can be simplified:
my #temp = split /\t/, trim($_), 2;
push #output, #temp, "\n" ;

What if you simplify things by flattening out your #file array with join.
Then you can just split it up and deal with the list.
Eg:
!/usr/bin/perl
my #file = ("file1\tfile3 ","file1\tfile3\tfile3 ","file2"); # Some test data.
my $in = join "\t", #file; # Make one string.
my #temp = split(" ", $in); # Split it on whitespace.
# Did it work?
foreach(#temp)
{
print "($_)\n"; # use () to see if we have any white spaces.
}
Might be a problem if you have spaces in your filenames though!

Related

Creating array of file names using grep

I'm having difficulty outputting file names as an array using grep. Specifically, I want to create an array of file names (plant photos) formatted like this:
Ilex_verticillata= Ilex_verticillata1.png, Ilex_verticillata2.png, Ilex_verticillata3.png
Asarum_canadense= Asarum_canadense1.png, Asarum_canadense2.png
Ageratina_altissi= Ageratina_altissi1.png, Ageratina_altissi2.png
Here's my original Perl script that I'm attempting to modify. It returns, as intended, ONE file name per plant as "Genus_species", printing a list of those plants:
#!/usr/bin/perl
use strict;
use warnings;
my $dir = '/Users/jdm/Desktop/xampp/htdocs/cnc/images/plants';
opendir my $dfh, $dir or die "Can't open $dir: $!";
my #files =
map { s/1\.png\z/.png/r } # Removes "1" from end of file names
grep { /^[^2-9]*\.png\z/i && /_/ } # Finds "Genus_species.png" & "Genus_species1.png" and returns one file name per plant as "Genus_species.png"
readdir $dfh;
foreach my$file (#files) {
$file =~s/\.png//; # Removes ".png" extension
print "$file\n"; #Prints list of file names (plant names)
}
Here's the output:
Ilex_verticillata
Asarum_canadense
Ageratina_altissima
However, since each plant often has MULTIPLE photos (e.g.-- "Genus_species1.png, Genus_species2.png, etc.), I need to re-grep the directory using the above output to find their file names, then output the results in the form of an array as previously illustrated.
I know the solution likely involves modifying the "foreach" statement, using grep to return ALL file names with "Genus_species" in their name. Here's what I tried:
foreach my$file (#files) {
$file =~s/\.png//;
grep ($file,readdir(DIR));
print "$file = $file\n";
But the output was this:
Ilex_verticillata = Ilex_verticillata
Asarum_canadense = Asarum_canadense
Ageratina_altissima = Ageratina_altissima
Again, I want to output an array formatted as:
"Genus_species= Genus_species1.png, Genus_species2.png, etc.," meaning I want it to look like this:
Ilex_verticillata= Ilex_verticillata1.png, Ilex_verticillata2.png, Ilex_verticillata3.png
Asarum_canadense= Asarum_canadense1.png, Asarum_canadense2.png
Ageratina_altissi= Ageratina_altissi1.png, Ageratina_altissi2.png
Notice that I also want to add back the ".png" extension ONLY to the file names to the right of the equals sign.
Please advise. Thanks.
Readdir returns a list of files in the folder. You've put them on one line, which is compact. However, if you loop them you can process the items further.
#!/usr/bin/perl
use strict;
use warnings;
use English; ## use names rather than symbols for special varables
my $dir = '/Users/jdm/Desktop/xampp/htdocs/cnc/images/plants';
opendir my $dfh, $dir or die "Can't open $dir: $OS_ERROR";
my %genus_species; ## store matching entries in a hash
for my $file (readdir $dfh)
{
next unless $file =~ /\d\.png$/; ## skip entry if not a png file ending with a number
my $genus = $file =~ s/\d\.png$//r;
push(#{$genus_species{$genus}}, $file); ## push to array,the #{} is to cast the single entry to a referance to an list
}
for my $genus (keys %genus_species)
{
print "$genus = ";
print "$_ " for sort #{$genus_species{$genus}}; # sort and loop though entries in list referance
print "\n";
}

Perl - push lines inbetween regex into one element of array

This is the log file I am dealing with -
|
blah1a
blah1b
blah1c
|
****blahnothing1
|
blah2a
blah2b
blah2c
|
blahnothing2
|
blah3a
blah3b
blah3c
|
blahnothing3
The information that I need is nestled between two pipe characters. There are alot of lines with that start with asteriks, I skip over them. Each line has windows end of line characters. The data in between these pipe characters is contigious, but when read on a linux host, it is chopped up with the windows new lines.
I wrote the perl script with a range operator between the two lines hoping that everything that starts with a pipe delimiter would get pushed into an array element and then stop at the next pipe delimiter, then start again. Each array element would have all the lines in between the two pipes characters.
Ideally the arrays would look like this, sans the windows control characters.
$lines[0] blah1a blah1b blah1c
$lines[1] blah2a blah2b blah2c
$lines[2] blah3a blah3b blah3c
However each arrays do not look like that.
#!/usr/bin/perl
use strict ;
use warnings ;
my $delimiter = "|";
my $filename = $ARGV[0] ;
my #lines ;
open(my $fh, '<:encoding(UTF-8)' , $filename) or die "could not open file $filename $!";
while (my $line = readline $fh) {
next if ($line =~/^\*+/) ;
if ($line =~ /$delimiter/ ... $line =~/$delimiter/) {
push (#lines, $line) ;
}
}
print $lines[0] ;
print $lines[1] ;
print $lines[2] ;
This seems to satisfy your requirement
I've left the two lines blahnothing2 and blahnothing3 in place as I couldn't see a rationale for removing them
The \R regex pattern is the generic newline, and matches the newline sequences from any platform, i.e. CR, LF, or CRLF
use strict;
use warnings 'all';
my $data = do {
open my $fh, '<:raw', 'blah.txt' or die $!;
local $/;
<$fh>;
};
$data =~ s/^\s*\*.*\R/ /gm; # Remove lines starting with *
$data =~ s/\R/ /g; # Change all line endings to spaces
# Split on pipe and remove blank elements
my #data = grep /\S/, split /\s*\|\s*/, $data;
use Data::Dump;
dd \#data;
output
[
"blah1a blah1b blah1c",
"blah2a blah2b blah2c",
"blahnothing2",
"blah3a blah3b blah3c",
"blahnothing3 ",
]
It seems that you want to merge lines between |, into a string, which gets placed on an array.
One way is to set the | as input record separator, so read a chunk between pipes each time
{ # localize the change to $/
local $/ = "|";
open(my $fh, '<:encoding(UTF-8)' , $filename)
or die "could not open file $filename $!";
my #records;
while (my $section = <$fh>)
{
next if $section =~ /^\s*\*/;
chomp $section; # remove the record separator (| here)
$section =~ s/\R/ /g; # clean up newlines
$section =~ s/^\s*//; # clean up leading spaces
push #records, $section if $section;
}
print "$_\n" for #records;
}
I skip a "section" if it starts with * (and an optional space). There can be more restrictive versions. The $section can end up being an emtpy string, so we push it on the array conditionally.
Output, with the example in the question copy-pasted into the input file with $filename
blah1a blah1b blah1c
blah2a blah2b blah2c
blahnothing2
blah3a blah3b blah3c
blahnothing3
The approach in the question is fine, but you need to merge lines within a "section" (between pipes) and place each such string on the array. So you need a flag to track when enter/leave a section.

Perl: Want to add and then average columns of tab delimited data

Data is a table that includes names in the first row and first column so I keep getting a non-numeric value error. I figured out how to ignore the first row by using if ($row[0] ne "retrovirus" ), but I don't know how to ignore the first column. I am new to programming and having a really hard time understanding arrays and how to get them to work. How do I split my data into columns of numbers excluding the words and add them together?
This is what I have so far, and its giving incorrect answers.
#!/usr/bin/perl -w
use strict;
# Part A. Computing the average bp length of the virus's
# genomes and each individual gene in the text file.
my $infile = "lab1_table.txt";
open INFILE, $infile or die "$infile: $!";
my #totals = ();
while (my $line = <INFILE>){
chomp $line;
my $total = 0;
my $n = 0;
# Splitting into columns
my #row = split /\t/, $line;
# Working through and adding up each column
foreach my $element (#row) {
# Ignoring first line with headings
if ($row[0] ne "retrovirus" ){
$total = $total + $element;
print "$total \n";
}
}
}
close INFILE;
If you totally don't care about the first element of the row, just use shift(#row)
before the foreach loop. Or if you want to preserve the original values you can get the elements from the second to the last:
#!/usr/bin/perl -w
use strict;
# Part A. Computing the average bp length of the virus's
# genomes and each individual gene in the text file.
my $infile = "lab1_table.txt";
open INFILE, $infile or die "$infile: $!";
while (my $line = <INFILE>)
{
chomp $line;
my $total = 0;
# Splitting into columns
my #row = split /\t/, $line;
# Working through and adding up each column
if ($row[0] ne "retrovirus" )
{
map { $total += $_ } #row[1..(scalar(#row) - 1)];
print "$total \n";
}
}
close INFILE;

Empty array in a perl while loop, should have input

Was working on this script when I came across a weird anomaly. When I go to print #extract after declaring it, it prints correctly the following:
------MMMMMMMMMMMMMMMMMMMMMMMMMM-M-MMMMMMMM
------SSSSSSSSSSSSSSSSSSSSSSSSSS-S-SSSSSDTA
------TIIIIIIIIIIIIITIIIVVIIIIII-I-IIIIITTT
Now the weird part, when I then try to print or return #extract (or $column) inside of the while loop, it comes up empty, thus rendering the rest of the script useless. I've never come across this before up until now, haven't been able to find any documentation or people with similar problems as mine. Below is the code, I marked with #<------ where the problems are and are not, to see if anyone can have any idea what is going on? Thank you kindly.
P.S. I am utilizing perl version 5.12.2
use strict;
use warnings;
#use diagnostics;
#use feature qw(say);
open (S, "Val nuc align.txt") || die "cannot open FASTA file to read: $!";
open (OUTPUT, ">output.txt");
my #extract;
my $sum = 0;
my #lines = <S>;
my #seq = ();
my $start = 0; #amino acid column start
my $end = 10; #amino acid column end
#Removing of the sequence tag until amino acid sequence composition (from >gi to )).
foreach my $line (#lines) {
$line =~ s/\n//g;
if ($line =~ />/g) {
$line =~ s/>.*\]/>/g;
push #seq, $line;
}
else {
push #seq, $line;
}
}
my $seq = join ('', #seq);
my #seq_prot = join "\n", split '>', $seq;
#seq_prot = grep {/[A-Z]/} #seq_prot;
#number of sequences
print OUTPUT "Number of sequences:", scalar (grep {defined} #seq_prot), "\n";
#selection of amino acid sequence. From $start to $end.
my #vertical_array;
while ( my $line = <#seq_prot> ) {
chomp $line;
my #split_line = split //, $line;
for my $index ( $start..$end ) { #AA position, extracts whole columns
$vertical_array[$index] .= $split_line[$index];
}
}
# Print out your vertical lines
for my $line ( #vertical_array ) {
my $extract = say OUTPUT for unpack "(a200)*", $line; #split at end of each column
#extract = grep {defined} $extract;
}
print OUTPUT #extract; #<--------------- This prints correctly the input
#Count selected amino acids excluding '-'.
my %counter;
while (my $column = #extract) {
print #extract; #<------------------------ Empty print, no input found
}
Update: Found the main problem to be with the unpack command, I thought I could utilize it to split my columns of my input at X elements (43 in this case). While this works, the minute I change $start to another number that is not 0 (say 200), the code brings up errors. Probably has something to do with the number of column elements does not match the lines. Will keep updated.
Write your last while loop the same way as your previous for loop. The assignment
my $column = #extract
is in scalar context, which does not give you the same result as:
for my $column (#extract)
Instead, it will give you the number of elements in the array. Try this second option and it should work.
However, I still have a concern, because in fact, if #extract had anything in it, you would obtain an infinite loop. Is there any code that you did not include between your two commented lines?

Checking for Duplicates in array

What's going on:
I've ssh'd onto my localhost, ls the desktop and taken those items and put them into an array.
I hardcoded a short list of items and I am comparing them with a hash to see if anything is missing from the host (See if something from a is NOT in b, and let me know).
So after figuring that out, when I print out the "missing files" I get a bunch of duplicates (see below), not sure if that has to do with how the files are being checked in the loop, but I figured the best thing to do would be to just sort out the data and eliminate dupes.
When I do that, and print out the fixed data, only one file is printing, two are missing.
Any idea why?
#!/usr/bin/perl
my $hostname = $ARGV[0];
my #hostFiles = ("filecheck.pl", "hostscript.pl", "awesomeness.txt");
my #output =`ssh $hostname "cd Desktop; ls -a"`;
my %comparison;
for my $file (#hostFiles) {
$comparison{$file} +=1;
}
for my $file (#output) {
$comparison{$file} +=2
}
for my $file (sort keys %comparison) {
#missing = "$file\n" if $comparison{$file} ==1;
#print "Extra file: $file\n" if $comparison{$file} ==2;
print #missing;
}
my #checkedMissingFiles;
foreach my $var ( #missing ){
if ( ! grep( /$var/, #checkedMissingFiles) ){
push( #checkedMissingFiles, $var );
}
}
print "\n\nThe missing Files without dups:\n #checkedMissingFiles\n";
Password:
awesomeness.txt ##This is what is printing after comparing the two arrays
awesomeness.txt
filecheck.pl
filecheck.pl
filecheck.pl
hostscript.pl
hostscript.pl
The missing Files without dups: ##what prints after weeding out duplicates
hostscript.pl
The perl way of doing this would be:
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my %hostFiles = qw( filecheck.pl 1 hostscript.pl 1 awesomeness.txt 1);
# ssh + backticks + ls, not the greatest way to do this, but that's another Q
my #files =`ssh $ARGV[0] "ls -a ~/Desktop"`;
# get rid of the newlines
chomp #files;
#grep returns the matching element of #files
my %existing = map { $_ => 1} grep {exists($hostFiles{$_})} #files;
print Dumper([grep { !exists($existing{$_})} keys %hostFiles]);
Data::Dumper is a utility module, I use it for debugging or demonstrative purposes.
If you want print the list you can do something like this:
{
use English;
local $OFS = "\n";
local $ORS = "\n";
print grep { !exists($existing{$_})} keys %hostFiles;
}
$ORS is the output record separator (it's printed after any print) and $OFS is the output field separator which is printed between the print arguments. See perlvar. You can get away with not using "English", but the variable names will look uglier. The block and the local are so you don't have to save and restore the values of the special variables.
If you want to write to a file the result something like this would do:
{
use English;
local $OFS = "\n";
local $ORS = "\n";
open F, ">host_$ARGV[0].log";
print F grep { !exists($existing{$_})} keys %hostFiles;
close F;
}
Of course, you can also do it the "classical" way, loop trough the array and print each element:
open F, ">host_$ARGV[0].log";
for my $missing_file (grep { !exists($existing{$_})} keys %hostFiles) {
use English;
local $ORS = "\n";
print F "File is missing: $missing_file"
}
close F;
This allows you to do more things with the file name, for example, you can SCP it over to the host.
It seems to me that looping over the 'required' list makes more sense - looping over the list of existing files isn't necessary unless you're looking for files that exist but aren't needed.
#!/usr/bin/perl
use strict;
use warnings;
my #hostFiles = ("filecheck.pl", "hostscript.pl", "awesomeness.txt");
my #output =`ssh $ARGV[0] "cd Desktop; ls -a"`;
chomp #output;
my #missingFiles;
foreach (#hostFiles) {
push( #missingFiles, $_ ) unless $_ ~~ #output;
}
print join("\n", "Missing files: ", #missingFiles);
#missing = "$file\n" assigns the array #missing to contain a single element, "$file\n". It does this every loop, leaving it with the last missing file.
What you want is push(#missing, "$file\n").

Resources