Perl: STDOUT/the output of shell command to an array directly - arrays

I have to access a shell command - hive within a Perl script, So I use `...`.
Assuming the result of `hive ... ...` contains 100000000 lines and is 20GB size.
what I want to achieve is like this:
#array = `hive ... ...`;
Does `` automatically know to use "\n" as separator to divide each line into the #array?
The 2 ways I can thought of are (but with problem in this case):
$temp = `hive ... ...`;
#array = split ( "\n", $temp );
undef $temp;
The problem of this way is that if the output of hive is too big in this case, the $temp cant store the output, resulting in segmentation fault core dump.
OR
`hive ... ... 1>temp.txt`;
open ( FP, <, "temp.txt" );
while (<FP>)
{
chomp;
push #array, $_;
}
close FP;
`rm temp.txt`;
But this way would be too slow, because it writes result first to hard-disk.
Is there a way to write the output of a shell command directly to an array without using any 'temporary container'?
Very Thanks for helping.

#array = `command`;
does, in fact, put each line of output from command into its own element of #array. There is no need to load the output into a scalar and split it yourself.
But 20GB of output stored in an array (and possibly 2-3 times that amount due to the way that Perl stores data) will still put an awful strain on your system.
The real solution to your problem is to stream the output of your command through an IO handle, and deal with one line at a time without having to load all of the output into memory at once. The way to do that is with Perl's open command:
open my $fh, "-|", "command";
open my $fh, "command |";
The -| filemode or the | appended to the command tells Perl to run an external command, and to make the output of that command available in the filehandle $fh.
Now iterate on the filehandle to receive one line of output at a time.
while (<$fh>) {
# one line of output is now in $_
do_something($_);
}
close $fh;

Related

perl add line into text file

I am writing a script to append text file by adding some text under specific string in the file after tab spacing. Help needed to add new line and tab spacing after matched string "apple" in below case.
Example File:
apple
<tab_spacing>original text1
orange
<tab_spacing>original text2
Expected output:
apple
<tab_spacing>testing
<tab_spacing>original text1
orange
<tab_spacing>original text2
What i have tried:
use strict;
use warnings;
my $config="filename.txt";
open (CONFIG,"+<$config") or die "Fail to open config file $config\n";
while (<CONFIG>) {
chop;
if (($_ =~ /^$apple$/)){
print CONFIG "\n";
print CONFIG "testing\n";
}
}
close CONFIG;
We cannot simply "add" text to a middle of a file as attempted. A file is a sequence of bytes and one cannot add or remove them (except at the end) but only change them. So if we start writing to a middle of a file then we are changing the bytes there, so overwriting what follows that place. Instead, we have to copy the rest of the text and write it back following the "addition," or to copy the file adding text in the process.
Yet another way is to read the whole file into a string and run a regex on it to change it, then write out the new string. Assuming that the file isn't too large for that
perl -0777 -pe's{apple\n\K(\t)}{Added text\n$1}g' in.txt
The -0777 switch makes it read the whole file into a string ("slurp" it), available in $_, to which the regex is bound by default. That \K, which is a lookbehind, drops previous matches so they are not consumed out of the string and we don't have to (capture and) put them back. With the /g modifier it keeps going through the whole string, to find and change all occurrences of the pattern.
This prints the changed file to screen, what can be saved in a new file by redirecting it
perl -0777 -pe'...' in.txt > out.txt
Or, one can change the input file "in place" with -i
perl -0777 -i.bak -pe'...' in.txt
The .bak makes it save the original with .bak extension. See switches in perlrun.
Another way is to use a lookahead for what follows (the tab) so that we don't have to capture and put it back
perl -0777 -pe's{apple\n\K(?=\t)}{Added text\n}g' in.txt
All of these produce the desired change.
Note on that tab ("tab_spacing")
The regex above assumes a tab character at the beginning of the line following the line with apple. When we say "tab" we mean one (tab) character.
But there are many reasons why there may in fact not be a tab character, even if it looks just like there is one. An example: all tabs may be automatically replaced by spaces by an editor.
So it may be safer to use \s+ (multiple spaces) instead of \t in the regex
s{apple\n\K(\s+)}{Added text\n$1}g
or
s{apple\n\K(?=\s+)}{Added text\n}g
If this is to be done inside of an existing larger Perl program (and not as a command-line program, "one-liner" as above), one way
use Path::Tiny; # path()
my $file_content = path($file)->slurp; # read the file into a string
# Now use a regex; all discussion above applies
$file_content =~ s{apple\n\K(?=\t)}{Added text\n}g;
# Print out $file_content, to be redirected etc. Or write to a file
path($new_file)->spew($file_content);
I use the library Path::Tiny to "slurp" the file into a string and spew to write $file_content to a new file. That need be installed as it is not in a "core" (doesn't usually come installed with Perl), and if that is a problem for some strange reason here is an idiom-of-sorts for it without any libraries
my $file_content = do {
local $/;
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>;
};
or even
my $file_content = do { local (#ARGV, $/) = $file; <> };
(see this post for some explanation and references)
Some pretty weird stuff in your code, to be honest:
Reading from and writing to a file at the same time is hard. Your code, for example, would just write all over the existing data in the file
Using a bareword filehandle (CONFIG) instead of a lexical variable and two-arg open() instead of the three-arg version (open my $config_fh, '+<', $config') makes me think you're using some pretty old Perl tutorials
Using chop() instead of chomp() makes me think you're using some ancient Perl tutorials
You seem to have an extra $ in your regex - ^$apple$ should probably be ^apple$
Also, Tie::File has been included with Perl's standard library for over twenty years and would make this task far easier.
#!/usr/bin/perl
use strict;
use warnings;
use Tie::File;
tie my #file, 'Tie::File', 'filename.txt' or die $!;
for (0 .. $#file) {
if ($file[$_] eq 'apple') {
splice #file, $_ + 1, 0, "\ttesting\n";
}
}
It's not entirely clear what you mean by "tab spacing", but you might be looking for:
perl -pE 'm/^(\t*)/; say "${1}testing" if $a; $a = /apple/' filename.txt
I suspect you actually want \s instead of \t, but YMMV. Basically, on each line of input, you match the leading whitespace and then print a line with that whitespace and the string 'testing' if the previous line matched.
To write it verbosely:
#!/usr/bin/env perl
use 5.12.0;
use strict;
use warnings;
my $n = 'filename.txt';
open my $f, '<', $n, or die "$n: $!\n";
while(<$f>){
m/^(\t*)/; # possibly \s is preferred over \t
say "${1}testing" if $a;
$a = /apple/;
print;
}

How can I overwrite file after replace the word?

I want to replace the text in the file and overwrite file.
use strict;
use warnings;
my ($str1, $str2, $i, $all_line);
$str1 = "AAA";
$str2 = "bbb";
open(RP, "+>", "testfile") ;
$all_line = $_;
$i = 0;
while(<RP>) {
while(/$str1/) {
$i++;
}
s/$str1/$str2/g;
print RP $_;
}
close(RP);
A normal process is to read the file line by line and write out each line, changed as/if needed, to a new file. Once that's all done rename that new file, atomically as much as possible, so to overwrite the orgiinal.
Here is an example of a library that does all that for us, Path::Tiny
use warnings;
use strict;
use feature 'say';
use Path::Tiny;
my $file = shift || die "Usage: $0 file\n";
say "File to process:";
say path($file)->slurp;
# NOTE: This CHANGES THE INPUT FILE
#
# Process each line: upper-case a letter after .
path($file)->edit_lines( sub { s/\.\s+\K([a-z])/\U$1/g } );
say "File now:";
say path($file)->slurp;
This upper-cases a letter following a dot (period), after some spaces, on each line where it is found and copies all other lines unchanged. (It's just an example, not a proper linguistic fix.)
Note: the input file is edited in-place so it will have been changed after this is done.
This capability was introduced in the module's version 0.077, of 2016-02-10. (For reference, Perl version 5.24 came in 2016-08. So with the system Perl 5.24 or newer, a Path::Tiny installed from an OS package or as a default CPAN version should have this method.)
Perl has a built-in in-place editing facility: The -i command line switch. You can access the functionality of this switch via $^I.
use strict;
use warnings;
my $str1 = "AAA";
my $str2 = "bbb";
local $^I = ''; # Same as `perl -i`. For `-i~`, assign `"~"`.
local #ARGV = "testfile";
while (<>) {
s/\Q$str1/$str2/g;
print;
}
I was not going to leave an answer, but I discovered when leaving a comment that I did have some feedback for you, so here goes.
open(RP, "+>", "testfile") ;
The mode "+>" will truncate your file, delete all it's content. This is described in the documentation for open:
You can put a + in front of the > or < to indicate that you want both read and write access to the file; thus +< is almost always preferred for read/write updates--the +> mode would clobber the file first. You can't usually use either read-write mode for updating textfiles, since they have variable-length records. See the -i switch in perlrun for a better approach.
So naturally, you can't read from the file if you first delete it's content. They mention here the -i switch, which is described like this in perl -h:
-i[extension] edit <> files in place (makes backup if extension supplied)
This is what ikegami's answer describes, only in his case it is done from within a program file rather than on the command line.
But, using the + mode for both reading and writing is not really a good way to do it. Basically it becomes difficult to print where you want to print. The recommended way is to instead read from one file, and then print to another. After the editing is done, you can rename and delete files as required. And this is exactly what the -i switch does for you. It is a predefined functionality of Perl. Read more about it in perldoc perlrun.
Next, you should use a lexical file handle. E.g. my $fh, instead of a global. And you should also check the return value from the open, to make sure there was not an error. Which gives us:
open my $fh, "<", "testfile" or die "Cannot open 'testfile': $!";
Usually if open fails, you want the program to die, and report the reason it failed. The error is in the $! variable.
Another thing to note is that you should not declare all your variables at the top of the file. It is good that you use use strict; use warnings, Perl code should never be written without them, but this is not how you handle it. You declare your variable as close to the place you use the variable as possible, and in the smallest scope possible. With a my declared variable, that is the nearest enclosing block { ... }. This will make it easy to trace your variable in bigger programs, and it will encapsulate your code and protect your variable.
In your case, you would simply put the my before all the variables, like so:
my $str1 = "AAA";
my $str2 = "bbb";
my $all_line = $_;
my $i = 0;
Note that $_ will be empty/undefined there, so that assignment is kind of pointless. If your intent was to use $all_line as the loop variable, you would do:
while (my $all_line = <$fh>) {
Note that this variable is declared in the smallest scope possible, and we are using a lexical file handle $fh.
Another important note is that your replacement strings can contain regex meta characters. Sometimes you want them to be included, like for example:
my $str1 = "a+"; # match one or more 'a'
Sometimes you don't want that:
my $str1 = "google.com"; # '.' is meant as a period, not the "match anything" character
I will assume that you most often do not want that, in which case you should use the escape sequence \Q ... \E which disables regex meta characters inside it.
So what do we get if we put all this together? Well, you might get something like this:
use strict;
use warnings;
my $str1 = "AAA";
my $str2 = "bbb";
my $filename = shift || "testfile"; # 'testfile', or whatever the program argument is
open my $fh_in, "<", $filename or die "Cannot open '$filename': $!";
open my $fh_out, ">", "$filename.out" or die "Cannot open '$filename.out': $!";
while (<$fh_in>) { # read from input file...
s/\Q$str1\E/$str2/g; # perform substitution...
print $fh_out $_; # print to output file
}
close $fh_in;
close $fh_out;
After this finishes, you may choose to rename the files and delete one or the other. This is basically the same procedure as using the -i switch, except here we do it explicitly.
rename $filename, "$filename.bak"; # save backup of input file in .bak extension
rename "$filename.out", $filename; # clobber the input file
Renaming files is sometimes also facilitated by the File::Copy module, which is a core module.
With all this said, you can replace all your code with this:
perl -i -pe's/AAA/bbb/g' testfile
And this is the power of the -i switch.

Save STDIN to file in Perl?

I'm trying to use Perl to:
wait until at least 1 byte arrives on STDIN
read all of STDIN and save it to a file as binary (to be UTF-8 compatible)
My attempts:
# read STDIN into a string/buffer and save it (maybe this only reads one line):
echo hello | perl -e 'open (fh, ">", "my_filename.txt"); print fh <STDIN>';
# or:
echo hello | perl -e 'open STDIN, ">", "my_filename.txt"'
# ideally be able to specify the filename:
echo hello | perl -e 'open STDIN, ">", $ARGV[0]' my_filename.txt
I can't use the shell to do this with echo hello > my_filename.txt because of an ancient bug/feature of shells where all file handles are opened simultaneously instead of sequentially based on some dependency logic. So my plan is to wait until there are bytes waiting on STDIN before opening the destination file.
Unfortunately I'm in a time crunch and don't have time to relearn the syntax of Perl. I think where I'm going wrong is that I'm trying to refer to STDIN as a new variable here instead of the actual input stream. I've looked at various other buffering solutions but unfortunately they're all inadequate in some way (don't actually work, aren't cross-platform, require additional executables to be installed, etc).
Bonus points if you perform the redirection either directly or by looping over chunks to reduce memory usage. Knowing how to read from a file and write to STDOUT would also be helpful. This may also be possible in something like awk, but note that I'm avoiding sed due to another ancient poor choice of syntax between GNU sed and BSD sed that often prevents using a single command on both platforms without tweaks. So Perl seems to be the way to go.
I can't be sure, but I think you want to do something like
cat foo | some_operation > foo
If so, the simple solution is
cat foo | some_operation | sponge foo
But what you directly asked for can be achieved as follows:
perl -e'
#ARGV or die("usage\n");
my $qfn = shift;
my $line = <>;
open(STDOUT, ">", $qfn)
or die("open $qfn: $!\n");
print $line;
print while <STDIN>;
' file.out
We can take advantage of -p.
perl -pe'
BEGIN {
#ARGV or die("usage\n");
$qfn = shift(#ARGV);
}
if ($. == 1) {
open(STDOUT, ">", $qfn)
or die("open $qfn: $!\n");
}
' file.out
We can throw in -s.
perl -spe'
BEGIN { defined($o) or die("usage\n"); }
if ($. == 1) {
open(STDOUT, ">", $o)
or die("open $o: $!\n");
}
' -- -o=file.out
Finally, we can get rid of error checking.
perl -spe'open(STDOUT, ">", $o) if $. == 1' -- -o=file.out
Bonus: If you'd like to read from a file instead of STDIN, all of the above are capable of doing so. Just provide the name of the file from which to read as a subsequent argument. You can even provide more than one name.

using perl array as input to bash bedtools command

I'm wondering if it is possible to use a perl array as the input to a program called bedtools ( http://bedtools.readthedocs.org/en/latest/ )
The array is itself generated by bedtools via the backticks method in perl. When I try to use the perl array in another bedtools bash command it complains that the argument list is too long because it seems to treat each word or number in the array as a separate argument.
Example code:
my #constit_super = `bedtools intersect -wa -a $enhancers -b $super_enhancer`;
that works fine and can be viewed by:
print #constit_super
which looks like this onscreen:
chr10 73629894 73634938
chr10 73636240 73639574
chr10 73639726 73657218
but then if I try to use this array in bedtools again e.g.
my $bedtools = `bedtools merge -i #constit_super`;
then i get this error message:
Can't exec "/bin/sh": Argument list too long
Is there anyway to use this perl array in bedtools?
many thanks
27/9/14 thanks for the info on doing it via a file. however, sorry to be a pain I would really like to do this without writing a file if possible.
I haven't tested this but I think it would work.
bedtools is expecting one argument with the -i flag, the name of a .bed file. This was in the docs. You need to write your array to a file and then input it into the bedtools merge command.
open(my $fh, '>', "input.bed") or die $!;
print $fh join("", #constit_super);
close $fh;
Then you can sort it with this command from the docs:
`sort -k1,1 -k2,2n input.bed > input.sorted.bed`;
Finally, you can run your merge command.
my $bedtools = `bedtools merge -i input.sorted.bed`;
Hopefully this sets you on the right track.

basic perl: regex statement not working in perl 5.10.1 machine but works in 5.18?

I am testing the following piece of code which compares a wordlist of arrays to a string and searches for matches. The problem is only in the regex statement at the bottom of the code, it doesn't produce any results/matches in on the linux server at school which i am supposed to run/test it on which uses perl 5.10.1. It did seem to run fine on my local windows machine running strawberry perl for some reason?
is there any other way the below regex statement can be modified to achieve the same result on older versions of perl(or have i made a mistake somewhere else)?
use strict;
use warnings;
use Data::Dumper;
my $keyword_file = "keywords.txt"; #converts this to an array #keywords
my $myString = "this string will be loger later; print";
#read keywords
my #keywords;
open (FH, "$keyword_file") or die "Can't open $keyword_file for read: $!";
while (<FH>) {
chomp;
push (#keywords, $_);
}
close FH or die "Cannot close $keyword_file: $!";
#compare keywords and file string
foreach (#keywords)
{
if($myString =~ /$_/){ # having problems here <<***************
print "foud match";
}
}
the keyword file is just a simple text file which contains single words like print, while, exit ...etc on each line.
keywords.txt:
exit
ls
print
grep
You created the file on Windows. Windows uses CR LF as line endings. You then transferred it to a unix system without converting the line endings to unix line endings (LF). That means that $_ contains exit plus a carriage return rather than just exit.
Fix the line endings (e.g. using dos2unix), or handle Windows files on unix systems (replace chomp; with s/\s+\z//;).

Resources