Split tab delimited data into arrays - arrays

I need to split data in columns separated by the tab delimiter for input example (row 1): abc<tab>def<tab>ghi so that each column is put in a different array.
Can I achieve this using split('\t')?

I'm not entirely sure what you want to do as you provide no reference code. However, this will at least crudely do what I think you want. Take an input file (better to read in assuming it's longer that the one provided) and for each line in the array holding the input, split on \t and push the first element onto #group1 etc. You should be able to print out each element from there...
#!/usr/bin/perl
use warnings;
use strict;
open my $infile, '<', 'in.txt' or die "Can't read from $file: $!";
my (#group1, #group2, #group3);
while (<$infile>){
my #cols = split(/\t/);
push #group1, $cols[0];
push #group2, $cols[1];
push #group3, $cols[2];
}
print "$group1[0]\n";
print "$group2[0]\n";
print "$group3[0]\n";
Output:
abc
def
ghi

Related

How to read a .txt file and store it into an array

I know this is a fairly simple question, but I cannot figure out how to store all of the values in my array the way I want to.
Here is a small portion what the .txt file looks like:
0 A R N D
A 2 -2 0 0
R -2 6 0 -1
N 0 0 2 2
D 0 -1 2 4
Each value is delimited by either two spaces - if the next value is positive - or a space and a '-' - if the next value is negative
Here is the code:
use strict;
use warnings;
open my $infile, '<', 'PAM250.txt' or die $!;
my $line;
my #array;
while($line = <$infile>)
{
$line =~ /^$/ and die "Blank line detected at $.\n";
$line =~ /^#/ and next; #skips the commented lines at the beginning
#array = $line;
print "#array"; #Prints the array after each line is read
};
print "\n\n#array"; #only prints the last line of the array ?
I understand that #array only holds the last line that was passed to it. Is there a way where I can get #array to hold all of the lines?
You are looking for push.
push #array, $line;
You undoubtedly want to precede this with chomp to snip any newlines, first.
If file is small as compared to available memory of your machine then you can simply use below method to read content of file in to an array
open my $infile, '<', 'PAM250.txt' or die $!;
my #array = <$infile>;
close $infile;
If you are going to read a very large file then it is better to read it line by line as you are doing but use PUSH to add each line at end of array.
push(#array,$line);
I will suggest you also read about some more array manipulating functions in perl
You're unclear to what you want to achieve.
Is every line an element of your array?
Is every line an array in your array and your "words" are the elements of this array?
Anyhow.
Here is how you can achieve both:
use strict;
use warnings;
use Data::Dumper;
# Read all lines into your array, after removing the \n
my #array= map { chomp; $_ } <>;
# show it
print Dumper \#array;
# Make each line an array so that you have an array of arrays
$_= [ split ] foreach #array;
# show it
print Dumper \#array;
try this...
sub room
{
my $result = "";
open(FILE, <$_[0]);
while (<FILE>) { $return .= $_; }
close(FILE);
return $result;
}
so you have a basic functionality without great words. the suggest before contains the risk to fail on large files. fastest safe way is that. call it as you like...
my #array = &room('/etc/passwd');
print room('/etc/passwd');
you can shorten, rename as your convinience believes.
to the kidding ducks nearby: by this way the the push was replaced by simplictiy. a text-file contains linebreaks. the traditional push removes the linebreak and pushing up just the line. the construction of an array is a simple string with linebreaks. now contain the steplength...

perl split delimiter from file line by line

I have a text file named 'dataexample' with multiple line like this:
a|30|40
b|50|70
then I split the delimiter with this code:
open(FILE, 'dataexample') or die "File not exist";
while(<FILE>){
my #record = split(/\|/, $_);
print "$record[0]";
}
close FILE;
when I print "$record[0]" , this is what I got:
ab
what I expect :
a 30 40
so when I do print "$record[0][0]" I expect the output to be: a
Where I got it wrong?
Your loop while ( <FILE> ) { ... } reads a single line at a time from the file handle and puts it into $_
my #record = split(/\|/, $_) splits that line on pipe characters |, so since the first line is "a|30|40\n", #record will now be 'a', '30', "40\n". The newline read from the file remains, and you should use chomp to remove it if you don't want it there
So now $record[0] is a, which you print, and then go on to read the next line in the file, setting #record to 'b', '50', "70\n" this time. Now $record[0] is b, which you also print, showing ab on the console
You've now reached the end of the file, so the while loop terminates
It sounds like you're expecting a two-dimensional array. You can do that by pushing each array onto a main array each time you read a record, like this
use strict;
use warnings 'all';
open my $fh, '<', 'dataexample' or die qq{Unable to open "dataexample" for input: $!};
my #data;
while ( <$fh> ) {
chomp;
my #record = split /\|/;
push #data, \#record;
}
print "#{$data[0]}\n";
print "$data[0][0]\n";
output
a 30 40
a
Or, more concisely, like this, which produces exactly the same result but may be a little advanced for you
use strict;
use warnings 'all';
open my $fh, '<', 'dataexample' or die qq{Unable to open "dataexample" for input: $!};
my #data = map { chomp; [ split /\|/ ] } <$fh>;
print "#{$data[0]}\n";
print "$data[0][0]\n";
Some points to know about your own code
You must always use strict and use warnings 'all' at the top of every Perl program you write. It's a measure that will uncover many simple mistakes that you may not otherwise notice
You should use lexical filehandles together with the three-parameter form or open. And an open may fail for many other reasons that the file not existing, so you should include the built-in $! variable in your die string to say why it failed
Don't forget to chomp each record read from a file unless you want to keep then trailing newline or it doesn't matter to you
You will be able to write more concise code if you get used to using the default variable $_. For instance, the second parameter to split is $_ by default, so split(/\|/, $_) may be written as just split /\|/
You can use Data::Dumper to display the contents of your variables, which will help you to debug your code. Data::Dump is superior, but it isn't a core module so you will probably have to install it before you can use it in your code
You have to use
print "$record[1]";
print "$record[2]";
As they are stored in consecutive index values.
or
If you want to print the entire thing you can just do
print "#record\n";
You are printing the value at the first index in the array each time through the loop, and without the new line. So you get the first value from each line, right next to each other on the same line, thus ab.
Print the whole array, under quotes, with the new line. with your program changed a bit
use strict;
use warnings;
my $file = 'dataexample';
open my $fh, '<', $file or die "Error opening $file: $!";
while (<$fh>) {
chomp;
my #record = split(/\|/, $_);
print "#record\n";
}
close $fh;
With the quotes the elements are printed with spaces added between them so you get
a 30 40
b 50 70
If you print without quotes the elements get printed without extra spaces, so
this
print #record, "\n";
over the whole loop prints
a3040
b5070
If you don't have the new line "\n" either, it is all printed on one line so this
print #record;
altogether prints
a3040b5070
As for $record[0][0], this is not valid for the array you have. This would print from a two-dimensional array. Take, for example
my #data = ( [1.1, 2.2], [10, 20] );
This array #data has at its first index a reference to an array -- more precisely, an anonymous array [1.1, 2.2]. Its second element is an anonymous array [10, 20]. So $data[0][0] is: the first element of #data (so the first of the two anonymous arrays inside), and then the first element of that array, thus 1.1. Likewise $data[1][1] is 20.
Thanks to Sobrique for the comment.
But you don't have this in your program. When you split data into an array
while(<FILE>){
my #record = split(/\|/, $_);
# ...
}
it creates a new array named #record every time through the loop. So #record is a normal array, not two-dimensional. For that the syntax $record[0][0] doesn't mean much.
I think you're trying to create a 2d array, whereby each element contains all the pipe delimited items from each line of your input:
my #record;
while(<DATA>){
chomp;
my #split = split(/\|/);
push #record, [#split];
}
print "#{$record[0]}\n";
a 30 40
record[0] has the contents of column 1 - 'a' on the first iteration of the loop, 'b' on the second. record[1] has column 2 and so on. You put the print statement, print "record[0]" in the loop so you get 'a' printed in the first iteration and 'b' in the second.
To get what you wanted you need to replace you print statement with;
print join " ", #record, "\n";

Modify element in array. Deleting part of element in array. Perl

I need to get just the title of these songs out of a text file that has all of its info. The text file looks like this.
TRMMCAU128F9332597<SEP>SOEEWIZ12AB0182B09<SEP>YGGDRASIL<SEP>Beyond the Borders of Sanity
TRMMCCS12903CBEA4A<SEP>SOARHKB12AB0189EEA<SEP>Illegal Substance<SEP>Microphone Check
So the title would be the "Beyond the Borders of Sanity" and "Microphone Check"
I cannot figure out how to delete all that stuff before it. Here is the code i have so far:
# Checks for the argument, fail if none given
if(songs.txt != 0) {
print STDERR "You must specify the file name as the argument.\n";
exit 4;
}
# Opens the file and assign it to handle INFILE
open(INFILE, 'songs.txt') or die "Cannot open songs.txt: $!.\n";
#data = <INFILE>;
my #lines = map {$_ =~ /^T/ ? ($_ => 1) : ()} #data;
# This loops through each line of the file
#while($line = <INFILE>) {
#chomp;
# print $line;
# print #data;
#}
# Close the file handle
close INFILE;
print #lines;
It outputs this:
1TRMMCAU128F9332597<SEP>SOEEWIZ12AB0182B09<SEP>YGGDRASIL<SEP>Beyond the Borders of Sanity1
I realize the 1's dont do anything I was just playing around with it. Any help is greatly appreciated. Thanks.
Use the split function
#songs = map { chomp; (split /<SEP>/)[3] } #data;
Assuming <SEP> is literally in the file and you want the fourth delimited field, as it appears from the sample data.
Your data looks like data from the Million Song Dataset, which uses a literal <SEP> as the field delimiter. To get the last field--the song's title--you can do the following:
use strict;
use warnings;
#ARGV or die "You must specify the file name as the argument.\n";
while (<>) {
print $1 if /([^>]+)$/;
}
Usage: perl script.pl songs.txt [>outFile.txt]
The last, optional parameter directs output to a file.
Output on your dataset:
Beyond the Borders of Sanity
Microphone Check
The regex matches all characters from the end of the line that are not >, and captures them. If the match is successful, the capture (stored in $1) is printed.
Hope this helps!

Perl push function gives index values instead of array elements

I am reading a text file named, mention-freq, which has data in the following format:
1
1
13
2
I want to read the lines and store the values in an array like this: #a=(1, 1, 13, 2). The Perl push function gives the index values/line numbers, i.e., 1,2,3,4, instead of my desired output. Could you please point out the error? Here is what I have done:
use strict;
use warnings;
open(FH, "<mention-freq") || die "$!";
my #a;
my $line;
while ($line = <FH>)
{
$line =~ s/\n//;
push #a, $line;
print #a."\n";
}
close FH;
The bug is that you are printing the concatenation of #a and a newline. When you concatenate, that forces scalar context. The scalar sense of an array is not its contents but rather its element count.
You just want
print "#a\n";
instead.
Also, while it will not affect your code here, the normal way to remove the record terminator read in by the <> readline operator is using chomp:
chomp $line;

Perl - Open large txt file on server and create / save into smaller files of 100 lines each

I am trying to do this:
I FTP a large file of single words (~144,000 and one word per line)
I need to open uploaded file and create files with 100 lines max one
word per line (01.txt, 02.txt etc).
I would like the processed 100 to be REMOVED from the original file
AFTER the file of 100 is created.
The server is shared but, I can install modules if needed.
Now, my code below is very crude as my knowledge is VERY limited. One problem is opening the whole file into an array? The shared server does not sport enough memory I assume to open such a large file and read into memory all at once? I just want the first 100 lines. Below is just opening a file that is small enough to be loaded and getting 100 lines into an array. Nothing else. I typed it quickly so, prob has several issues but, show my limited knowledge and need for help.
use vars qw($Word #Words $IN);
my $PathToFile = '/home/username/public/wordlists/Big-File-Of-Words.txt';
my $cnt= '0';
open $IN, '<', "$PathToFile" or die $!;
while (<$IN>) {
chomp;
$Word = $_;
$Word=~ s/\s//g;
$Word = lc($Word);
######
if ($cnt <= 99){
push(#Words,$Word);
}
$cnt++;
}
close $IN;
Thanks so much.
Okay, I am trying to implement the code below:
#!/usr/bin/perl -w
BEGIN {
my $b__dir = (-d '/home/username/perl'?'/home/username/perl':( getpwuid($>) )[7].'/perl');
unshift #INC,$b__dir.'5/lib/perl5',$b__dir.'5/lib/perl5/x86_64-linux',map { $b__dir . $_ } #INC;
}
use strict;
use warnings;
use CGI;
use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
print CGI::header();
my $WORD_LIST='/home/username/public/wordlists/Big-File-Of-Words.txt';
sed 's/ *//g' $WORD_LIST | tr '[A-Z]' '[a-z]' | split -l 100 -a6 - words.
print 'Done';
1;
But I get:
syntax error at split-up-big-file.pl line 12, near "sed 's/ *//g'"
Can't find string terminator "'" anywhere before EOF at split-up-big-file.pl line 12.
FINALLY:
Well I figured out a quick solution that works. Not pretty:
#!/usr/bin/perl -w
BEGIN {
my $b__dir = (-d '/home/username/perl'?'/home/username/perl':( getpwuid($>) )[7].'/perl');
unshift #INC,$b__dir.'5/lib/perl5',$b__dir.'5/lib/perl5/x86_64-linux',map { $b__dir . $_ } #INC;
}
use strict;
use warnings;
use CGI;
use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
use diagnostics;
print CGI::header();
my $sourcefile = '/home/username/public_html/test/bigfile.txt';
my $rowlimit = 100;
my $cnt= '1';
open(IN, $sourcefile) or die "Failed to open $sourcefile";
my $outrecno = 1;
while(<IN>) {
if($outrecno == 1) {
my $filename= $cnt.'.txt';
open OUT, ">$filename" or die "Failed to create $filename";
$cnt++;
}
print OUT $_;
if($outrecno++ == $rowlimit) {
$outrecno = 1;
close FH;
}
}
close FH;
I found enough info here to get me going. Thanks...
Here is a solution based on a slight modification of your code that should work approximately the way you want it.
It loops through all the lines of the input file and for every 100th line it will write the word list of the words encountered since the last write (or the beginning). The eof($IN) check is to catch the remaining lines if they are less than 100.
use strict;
use warnings;
my $PathToFile = '/home/username/public/wordlists/Big-File-Of-Words.txt';
open my $IN, '<', "$PathToFile" or die $!;
my $cnt = 0;
my $cnt_file = 0;
my #Words;
while ( my $Word = <$IN> ) {
chomp $Word;
$Word =~ s/\s//g;
$Word = lc($Word);
######
push(#Words,$Word);
if ( !(++$cnt % 100) || eof($IN) ) {
$cnt_file++;
open my $out_100, '>', "file_$cnt_file.txt" or die $!;
print $out_100 join("\n", #Words), "\n";
close $out_100;
#Words = ();
}
}
There's a non-Perl solution that you might find interesting...
$ split -l 100 -a6 /home/username/public/wordlists/Big-File-Of-Words.txt words.
This will split your big file of words into a bunch of files with no more than 100 lines each. The file name will start with words., and the suffix will range from aaaaaa to zzzzzz. Thus, you'll have words.aaaaaa, words.aaaaab, words.aaaaac, etc. You can then recombine all of these files back into your word list like this:
$ cat words.* > reconstituted_word_list.txt
Of course, you want to eliminate spaces, and lowercase the words all at the same time:
$ WORD_LIST=/home/username/public/wordlists/Big-File-Of-Words.txt
$ sed 's/ *//g' $WORD_LIST | tr '[A-Z]' '[a-z]' | split -l 100 -a6 - words.
The tr is the transformation command, and will change all uppercase to lower case. The split splits the files, and sed removes the spaces.
One of Unix's big strengths was its file handling ability. Splitting up big files into smaller pieces and reconstituting them was a common task. Maybe you had a big file, but a bunch of floppy disks that couldn't hold more than 100K per floppy. Maybe you were trying to use UUCP to copy these files over to another computer and there was a 10K limit on file transfer sizes. Maybe you were doing FTP by email, and the system couldn't handle files larger than 5K.
Anyway, I brought it up because it's probably an easier solution in your case than writing a Perl script. I am a big writer of Perl, and many times Perl can handle a task better and faster than shell scripts can. However, in this case, this is an easy task to handle in shell.
Here's a pure Perl solution. The problem is that you want to create files after every 100 lines.
To solve this, I have two loops. One is an infinite loop, and the other loops 100 times. Before I enter the inner loop, I create a file for writing, and write one word per line. When that inner loop ends, I close the file, increment my $output_file_num and then open another file for output.
A few changes:
I use use warnings; and use strict (which is included when you specify that you want Perl version 5.12.0 or greater).
Don't use use vars;. This is obsolete. If you have to use package variables, declare the variable with our instead of my. When should you use package variables? If you have to ask that question, you probably don't need package variables. 99.999% of the time, simply use my to declare a variable.
I use constant to define your word file. This makes it easy to move the file when needed.
My s/../../ not only removes beginning and ending spaces, but also lowercases my word for me. The ^\s*(.*?)\s*$ removes the entire line, but captures the word sans spaces at the beginning and end of the word. The .*? is like .*, but is non-greedy. It will match the minimum possible (which in this case does not include spaces at the end of the word).
Note I define a label INPUT_WORD_LIST. I use this to force my inner last to exit the outer loop.
I take advantage of the fact that $output_word_list_fh is defined only in the loop. Once I leave the loop, the file is automatically closed for me since $output_word_list_fh in out of scope.
And the program:
#!/usr/bin/env perl
use 5.12.0;
use warnings;
use autodie;
use constant WORD_FILE => "/home/username/public/wordlists/Big-File-Of-Words.txt";
open my $input_word_list_fh, "<", WORD_FILE;
my $output_file_num = 0;
INPUT_WORD_LIST:
for (;;) {
open my $output_word_list_fh, ">", sprintf "%05d.txt", $output_file_num;
for my $line (1..100) {
my $word;
if ( not $word = <$input_word_list_fh> ) {
last INPUT_WORD_LIST;
}
chomp $word;
$word =~ s/^\s*(.*?)\s*$/\L$1\E/;
say {$output_word_list_fh} "$word";
}
close $output_word_list_fh;
$output_file_num += 1;
}
close $input_word_list_fh;

Resources