Perl text file grep - arrays

I would like to create an array in Perl of strings that I need to search/grep from a tab-deliminated text file. For example, I create the array:
#!/usr/bin/perl -w
use strict;
use warnings;
# array of search terms
my #searchArray = ('10060\t', '10841\t', '11164\t');
I want to have a foreach loop to grep a text file with a format like this:
c18 10706 463029 K
c2 10841 91075 G
c36 11164 . B
c19 11257 41553 C
for each of the elements of the above array. In the end, I want to have a NEW text file that would look like this (continuing this example):
c2 10841 91075 G
c36 11164 . B
How do I go about doing this? Also, this needs to be able to work on a text file with ~5 million lines, so memory cannot be wasted (I do have 32GB of memory though).
Thanks for any help/advice in advanced! Cheers.

Using a perl one-liner. Just translate your list of numbers into a regex.
perl -ne 'print if /\b(?:10060|10841|11164)\b/' file.txt > newfile.txt

You can search for alternatives by using a regexp like /(10060\t|100841\t|11164\t)/. Since your array could be large, you could create this regexp, by something like
$searchRegex = '(' + join('|',#searchArray) + ')';
this is just a simple string, and so it would be better (faster) to compile it to a regexp by
$searchRegex = qr/$searchRegex/;
With only 5 million lines, you could actually pull the entire file into memory (less than a gigabyte if 100 chars/line), but otherwise, line by line you could search with this pattern as in
while (<>) {
print if $_ =~ $searchRegex
}

So I'm not the best coder but this should work.
#!/usr/bin/perl -w
use strict;
use warnings;
# array of search terms
my $searchfile = 'file.txt';
my $outfile = 'outfile.txt';
my #searchArray = ('10060', '10841', '11164');
my #findArray;
open(READ,'<',$searchfile) || die $!;
while (<READ>)
{
foreach my $searchArray (#searchArray) {
if (/$searchArray/) {
chomp ($_);
push (#findArray, $_) ;
}
}
}
close(READ);
### For Console Print
#foreach (#findArray){
# print $_."\n";
#}
open(WRITE,'>',$outfile) || die $!;
foreach (#findArray){
print WRITE $_."\n";
}
close(WRITE);

Related

How to read a .txt file and store it into an array

I know this is a fairly simple question, but I cannot figure out how to store all of the values in my array the way I want to.
Here is a small portion what the .txt file looks like:
0 A R N D
A 2 -2 0 0
R -2 6 0 -1
N 0 0 2 2
D 0 -1 2 4
Each value is delimited by either two spaces - if the next value is positive - or a space and a '-' - if the next value is negative
Here is the code:
use strict;
use warnings;
open my $infile, '<', 'PAM250.txt' or die $!;
my $line;
my #array;
while($line = <$infile>)
{
$line =~ /^$/ and die "Blank line detected at $.\n";
$line =~ /^#/ and next; #skips the commented lines at the beginning
#array = $line;
print "#array"; #Prints the array after each line is read
};
print "\n\n#array"; #only prints the last line of the array ?
I understand that #array only holds the last line that was passed to it. Is there a way where I can get #array to hold all of the lines?
You are looking for push.
push #array, $line;
You undoubtedly want to precede this with chomp to snip any newlines, first.
If file is small as compared to available memory of your machine then you can simply use below method to read content of file in to an array
open my $infile, '<', 'PAM250.txt' or die $!;
my #array = <$infile>;
close $infile;
If you are going to read a very large file then it is better to read it line by line as you are doing but use PUSH to add each line at end of array.
push(#array,$line);
I will suggest you also read about some more array manipulating functions in perl
You're unclear to what you want to achieve.
Is every line an element of your array?
Is every line an array in your array and your "words" are the elements of this array?
Anyhow.
Here is how you can achieve both:
use strict;
use warnings;
use Data::Dumper;
# Read all lines into your array, after removing the \n
my #array= map { chomp; $_ } <>;
# show it
print Dumper \#array;
# Make each line an array so that you have an array of arrays
$_= [ split ] foreach #array;
# show it
print Dumper \#array;
try this...
sub room
{
my $result = "";
open(FILE, <$_[0]);
while (<FILE>) { $return .= $_; }
close(FILE);
return $result;
}
so you have a basic functionality without great words. the suggest before contains the risk to fail on large files. fastest safe way is that. call it as you like...
my #array = &room('/etc/passwd');
print room('/etc/passwd');
you can shorten, rename as your convinience believes.
to the kidding ducks nearby: by this way the the push was replaced by simplictiy. a text-file contains linebreaks. the traditional push removes the linebreak and pushing up just the line. the construction of an array is a simple string with linebreaks. now contain the steplength...

Assigning range to an array in Perl

I have some mini problem. How can I assign a range into an array, like this one:
input file: clktest.spf
*
.GLOBAL vcc! vss!
*
.SUBCKT eclk_l_25h brg2eclk<1> brg2eclk<0> brg_cs_sel brg_out brg_stop cdivx<1>
+ eclkout1<24> eclkout1<23> eclkout1<22> eclkout1<21> eclkout1<20> eclkout1<19>
+ mc1_brg_dyn mc1_brg_outen mc1_brg_stop mc1_div2<1> mc1_div2<0> mc1_div3p5<1>
+ mc1_div3p5<0> mc1_div_mux<3> mc1_div_mux<2> mc1_div_mux<1> mc1_div_mux<0>
+ mc1_gsrn_dis<0> pclkt6_0 pclkt6_1 pclkt7_0 pclkt7_1 slip<1> slip<0>
+ ulc_pclkgpll0<1> ulc_pclkgpll0<0> ulq_eclkcib<1> ulq_eclkcib<0>
*
*Net Section
*
*|GROUND_NET 0
*
*|NET eclkout3<48> 2.79056e-16
*|P (eclkout3<48> X 0 54.8100 -985.6950)
*|I (RXR0<16>#NEG RXR0<16> NEG X 0 54.2255 -985.6950)
C1 RXR0<16>#NEG 0 5.03477e-17
C2 eclkout3<48> 0 2.28708e-16
Rk_6_1 eclkout3<48> RXR0<16>#NEG 0.110947
output (this should be the saved value in the array)
.SUBCKT eclk_l_25h brg2eclk<1> brg2eclk<0> brg_cs_sel brg_out brg_stop cdivx<1>
+ eclkout1<24> eclkout1<23> eclkout1<22> eclkout1<21> eclkout1<20> eclkout1<19>
+ mc1_brg_dyn mc1_brg_outen mc1_brg_stop mc1_div2<1> mc1_div2<0> mc1_div3p5<1>
+ mc1_div3p5<0> mc1_div_mux<3> mc1_div_mux<2> mc1_div_mux<1> mc1_div_mux<0>
+ mc1_gsrn_dis<0> pclkt6_0 pclkt6_1 pclkt7_0 pclkt7_1 slip<1> slip<0>
+ ulc_pclkgpll0<1> ulc_pclkgpll0<0> ulq_eclkcib<1> ulq_eclkcib<0>
*
*Net Section
my simple code:
#!/usr/bin/perl
use strict;
use warnings;
my $input = "clktest.spf";
open INFILE, $input or die "Can't open $input" ;
my #allports;
while (<INFILE>){
#allports = /\.SUBCKT/ ... /\*Net Section/ ;
print #allports;
}
I am doing a correct job of assigning the selected range into an array? If not how can I modify this code?
Thanks for advance.
The while loop only gives you one line at a time so you can't assign all of the lines you want at once. Use push instead to grow the array line by line.
Also, you should be using lexical file handles like $in_fh (rather than global ones like INFILE) with the three-parameter form of open, and you should include the $! variable in the die string so that you know why the open failed.
This is how your program should look
#!/usr/bin/perl
use strict;
use warnings;
my $input = 'clktest.spf';
open my $in_fh, '<', $input or die "Can't open $input: $!" ;
my #allports;
while ( <$in_fh> ) {
push #allports, $_ if /\.SUBCKT/ ... /\*Net Section/;
}
print #allports;
Note that, if all you want to do is to print the selected lines from the file, you can forget about the array and replace push #allports, $_ with print
The <INFILE> inside a while will read the file line-by-line, so not a right place to apply a regex which need to cover more than one line. In order to get a substring, the simplest way is to first join all these lines. And only after that you apply your regex.
my $contents = "";
while ( <INFILE> ) {
$contents = $contents . $_;
}
$contents =~ s/.*(\.SUBCKT.*\*Net Section).*/$1/s; # remove unneeded part
Please note that there is /s modifier in the last part of substitution line. This is required because $contents contains newlines.
To get the substring into array, just use split my #allports = split("\n", $contents);

Perl split array

I'm new in Perl, I want to write a simple program which reads an input-file and count the letters of this file, this is my code:
#!/usr/bin/perl
$textfile = "example.txt";
open(FILE, "< $textfile");
#array = split(//,<FILE>);
$counter = 0;
foreach(#array){
$counter = $counter + 1;
}
print "Letters: $counter";
this code shows me the number of letters, but only for the first paragraph of my Input-File, it doesn't work for more than one paragraph, can anyone help me, i don't know the problem =(
thank you
You only ever read one line.
You count bytes (for which you could use -s), not letters.
Fix:
my $count = 0;
while (<>) {
$count += () = /\pL/g;
}
You code is a rather over-complicated way of doing this:
#!/usr/bin/perl
# Always use these
use strict;
use warnings;
# Define variables with my
my $textfile = "example.txt";
# Lexical filehandle, three-argument open
# Check return from open, give sensible error
open(my $file, '<', $textfile) or die "Can't open $textfile: $!"
# No need for an array.
my $counter = length <$file>;
print "Letters: $counter";
But, as others have pointed out, you're counting bytes not characters. If your file is in ASCII or an 8-bit encoding, then you should be fine. Otherwise you should look at perluniintro.
Here's an alternative aproach using a module to do the work ..
# the following two lines enforce 'clean' code
use strict;
use warnings;
# load some help (read_file)
use File::Slurp;
# load the file into the variable $text
my $text = read_file('example.txt');
#get rid of multiple whitespace and linefeed chars # ****
# and replace them with a single space # ****
$text =~ s/\s+/ /; # ****
# length gives you the length of the 'string' / scalar variable
print length($text);
you might want to comment out the lines marked '****'
and play with the code...

Perl - Open large txt file on server and create / save into smaller files of 100 lines each

I am trying to do this:
I FTP a large file of single words (~144,000 and one word per line)
I need to open uploaded file and create files with 100 lines max one
word per line (01.txt, 02.txt etc).
I would like the processed 100 to be REMOVED from the original file
AFTER the file of 100 is created.
The server is shared but, I can install modules if needed.
Now, my code below is very crude as my knowledge is VERY limited. One problem is opening the whole file into an array? The shared server does not sport enough memory I assume to open such a large file and read into memory all at once? I just want the first 100 lines. Below is just opening a file that is small enough to be loaded and getting 100 lines into an array. Nothing else. I typed it quickly so, prob has several issues but, show my limited knowledge and need for help.
use vars qw($Word #Words $IN);
my $PathToFile = '/home/username/public/wordlists/Big-File-Of-Words.txt';
my $cnt= '0';
open $IN, '<', "$PathToFile" or die $!;
while (<$IN>) {
chomp;
$Word = $_;
$Word=~ s/\s//g;
$Word = lc($Word);
######
if ($cnt <= 99){
push(#Words,$Word);
}
$cnt++;
}
close $IN;
Thanks so much.
Okay, I am trying to implement the code below:
#!/usr/bin/perl -w
BEGIN {
my $b__dir = (-d '/home/username/perl'?'/home/username/perl':( getpwuid($>) )[7].'/perl');
unshift #INC,$b__dir.'5/lib/perl5',$b__dir.'5/lib/perl5/x86_64-linux',map { $b__dir . $_ } #INC;
}
use strict;
use warnings;
use CGI;
use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
print CGI::header();
my $WORD_LIST='/home/username/public/wordlists/Big-File-Of-Words.txt';
sed 's/ *//g' $WORD_LIST | tr '[A-Z]' '[a-z]' | split -l 100 -a6 - words.
print 'Done';
1;
But I get:
syntax error at split-up-big-file.pl line 12, near "sed 's/ *//g'"
Can't find string terminator "'" anywhere before EOF at split-up-big-file.pl line 12.
FINALLY:
Well I figured out a quick solution that works. Not pretty:
#!/usr/bin/perl -w
BEGIN {
my $b__dir = (-d '/home/username/perl'?'/home/username/perl':( getpwuid($>) )[7].'/perl');
unshift #INC,$b__dir.'5/lib/perl5',$b__dir.'5/lib/perl5/x86_64-linux',map { $b__dir . $_ } #INC;
}
use strict;
use warnings;
use CGI;
use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
use diagnostics;
print CGI::header();
my $sourcefile = '/home/username/public_html/test/bigfile.txt';
my $rowlimit = 100;
my $cnt= '1';
open(IN, $sourcefile) or die "Failed to open $sourcefile";
my $outrecno = 1;
while(<IN>) {
if($outrecno == 1) {
my $filename= $cnt.'.txt';
open OUT, ">$filename" or die "Failed to create $filename";
$cnt++;
}
print OUT $_;
if($outrecno++ == $rowlimit) {
$outrecno = 1;
close FH;
}
}
close FH;
I found enough info here to get me going. Thanks...
Here is a solution based on a slight modification of your code that should work approximately the way you want it.
It loops through all the lines of the input file and for every 100th line it will write the word list of the words encountered since the last write (or the beginning). The eof($IN) check is to catch the remaining lines if they are less than 100.
use strict;
use warnings;
my $PathToFile = '/home/username/public/wordlists/Big-File-Of-Words.txt';
open my $IN, '<', "$PathToFile" or die $!;
my $cnt = 0;
my $cnt_file = 0;
my #Words;
while ( my $Word = <$IN> ) {
chomp $Word;
$Word =~ s/\s//g;
$Word = lc($Word);
######
push(#Words,$Word);
if ( !(++$cnt % 100) || eof($IN) ) {
$cnt_file++;
open my $out_100, '>', "file_$cnt_file.txt" or die $!;
print $out_100 join("\n", #Words), "\n";
close $out_100;
#Words = ();
}
}
There's a non-Perl solution that you might find interesting...
$ split -l 100 -a6 /home/username/public/wordlists/Big-File-Of-Words.txt words.
This will split your big file of words into a bunch of files with no more than 100 lines each. The file name will start with words., and the suffix will range from aaaaaa to zzzzzz. Thus, you'll have words.aaaaaa, words.aaaaab, words.aaaaac, etc. You can then recombine all of these files back into your word list like this:
$ cat words.* > reconstituted_word_list.txt
Of course, you want to eliminate spaces, and lowercase the words all at the same time:
$ WORD_LIST=/home/username/public/wordlists/Big-File-Of-Words.txt
$ sed 's/ *//g' $WORD_LIST | tr '[A-Z]' '[a-z]' | split -l 100 -a6 - words.
The tr is the transformation command, and will change all uppercase to lower case. The split splits the files, and sed removes the spaces.
One of Unix's big strengths was its file handling ability. Splitting up big files into smaller pieces and reconstituting them was a common task. Maybe you had a big file, but a bunch of floppy disks that couldn't hold more than 100K per floppy. Maybe you were trying to use UUCP to copy these files over to another computer and there was a 10K limit on file transfer sizes. Maybe you were doing FTP by email, and the system couldn't handle files larger than 5K.
Anyway, I brought it up because it's probably an easier solution in your case than writing a Perl script. I am a big writer of Perl, and many times Perl can handle a task better and faster than shell scripts can. However, in this case, this is an easy task to handle in shell.
Here's a pure Perl solution. The problem is that you want to create files after every 100 lines.
To solve this, I have two loops. One is an infinite loop, and the other loops 100 times. Before I enter the inner loop, I create a file for writing, and write one word per line. When that inner loop ends, I close the file, increment my $output_file_num and then open another file for output.
A few changes:
I use use warnings; and use strict (which is included when you specify that you want Perl version 5.12.0 or greater).
Don't use use vars;. This is obsolete. If you have to use package variables, declare the variable with our instead of my. When should you use package variables? If you have to ask that question, you probably don't need package variables. 99.999% of the time, simply use my to declare a variable.
I use constant to define your word file. This makes it easy to move the file when needed.
My s/../../ not only removes beginning and ending spaces, but also lowercases my word for me. The ^\s*(.*?)\s*$ removes the entire line, but captures the word sans spaces at the beginning and end of the word. The .*? is like .*, but is non-greedy. It will match the minimum possible (which in this case does not include spaces at the end of the word).
Note I define a label INPUT_WORD_LIST. I use this to force my inner last to exit the outer loop.
I take advantage of the fact that $output_word_list_fh is defined only in the loop. Once I leave the loop, the file is automatically closed for me since $output_word_list_fh in out of scope.
And the program:
#!/usr/bin/env perl
use 5.12.0;
use warnings;
use autodie;
use constant WORD_FILE => "/home/username/public/wordlists/Big-File-Of-Words.txt";
open my $input_word_list_fh, "<", WORD_FILE;
my $output_file_num = 0;
INPUT_WORD_LIST:
for (;;) {
open my $output_word_list_fh, ">", sprintf "%05d.txt", $output_file_num;
for my $line (1..100) {
my $word;
if ( not $word = <$input_word_list_fh> ) {
last INPUT_WORD_LIST;
}
chomp $word;
$word =~ s/^\s*(.*?)\s*$/\L$1\E/;
say {$output_word_list_fh} "$word";
}
close $output_word_list_fh;
$output_file_num += 1;
}
close $input_word_list_fh;

Checking for Duplicates in array

What's going on:
I've ssh'd onto my localhost, ls the desktop and taken those items and put them into an array.
I hardcoded a short list of items and I am comparing them with a hash to see if anything is missing from the host (See if something from a is NOT in b, and let me know).
So after figuring that out, when I print out the "missing files" I get a bunch of duplicates (see below), not sure if that has to do with how the files are being checked in the loop, but I figured the best thing to do would be to just sort out the data and eliminate dupes.
When I do that, and print out the fixed data, only one file is printing, two are missing.
Any idea why?
#!/usr/bin/perl
my $hostname = $ARGV[0];
my #hostFiles = ("filecheck.pl", "hostscript.pl", "awesomeness.txt");
my #output =`ssh $hostname "cd Desktop; ls -a"`;
my %comparison;
for my $file (#hostFiles) {
$comparison{$file} +=1;
}
for my $file (#output) {
$comparison{$file} +=2
}
for my $file (sort keys %comparison) {
#missing = "$file\n" if $comparison{$file} ==1;
#print "Extra file: $file\n" if $comparison{$file} ==2;
print #missing;
}
my #checkedMissingFiles;
foreach my $var ( #missing ){
if ( ! grep( /$var/, #checkedMissingFiles) ){
push( #checkedMissingFiles, $var );
}
}
print "\n\nThe missing Files without dups:\n #checkedMissingFiles\n";
Password:
awesomeness.txt ##This is what is printing after comparing the two arrays
awesomeness.txt
filecheck.pl
filecheck.pl
filecheck.pl
hostscript.pl
hostscript.pl
The missing Files without dups: ##what prints after weeding out duplicates
hostscript.pl
The perl way of doing this would be:
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my %hostFiles = qw( filecheck.pl 1 hostscript.pl 1 awesomeness.txt 1);
# ssh + backticks + ls, not the greatest way to do this, but that's another Q
my #files =`ssh $ARGV[0] "ls -a ~/Desktop"`;
# get rid of the newlines
chomp #files;
#grep returns the matching element of #files
my %existing = map { $_ => 1} grep {exists($hostFiles{$_})} #files;
print Dumper([grep { !exists($existing{$_})} keys %hostFiles]);
Data::Dumper is a utility module, I use it for debugging or demonstrative purposes.
If you want print the list you can do something like this:
{
use English;
local $OFS = "\n";
local $ORS = "\n";
print grep { !exists($existing{$_})} keys %hostFiles;
}
$ORS is the output record separator (it's printed after any print) and $OFS is the output field separator which is printed between the print arguments. See perlvar. You can get away with not using "English", but the variable names will look uglier. The block and the local are so you don't have to save and restore the values of the special variables.
If you want to write to a file the result something like this would do:
{
use English;
local $OFS = "\n";
local $ORS = "\n";
open F, ">host_$ARGV[0].log";
print F grep { !exists($existing{$_})} keys %hostFiles;
close F;
}
Of course, you can also do it the "classical" way, loop trough the array and print each element:
open F, ">host_$ARGV[0].log";
for my $missing_file (grep { !exists($existing{$_})} keys %hostFiles) {
use English;
local $ORS = "\n";
print F "File is missing: $missing_file"
}
close F;
This allows you to do more things with the file name, for example, you can SCP it over to the host.
It seems to me that looping over the 'required' list makes more sense - looping over the list of existing files isn't necessary unless you're looking for files that exist but aren't needed.
#!/usr/bin/perl
use strict;
use warnings;
my #hostFiles = ("filecheck.pl", "hostscript.pl", "awesomeness.txt");
my #output =`ssh $ARGV[0] "cd Desktop; ls -a"`;
chomp #output;
my #missingFiles;
foreach (#hostFiles) {
push( #missingFiles, $_ ) unless $_ ~~ #output;
}
print join("\n", "Missing files: ", #missingFiles);
#missing = "$file\n" assigns the array #missing to contain a single element, "$file\n". It does this every loop, leaving it with the last missing file.
What you want is push(#missing, "$file\n").

Resources