I have a huge file with multiple lines and columns. Each line has many columns and many lines have the same name in the same position. E.g.
A C Z Y X
A C E J
B E K L M
What is the best way to Find all lines that share the same items in a certain position? For instance, I would like to know that there are 2 A, 2 C, 1 D, etc., all ordered by column.
I am really new to Perl, and so I am struggling a lot to advance in this so any tips are appreciated.
I got to this point:
#!/usr/local/bin/perl -w
use strict;
my $path='My:\Path\To\My\File.txt';
my $columns;
my $line;
open (FILE,$path), print "Opened!\n" or die ("Error opening");
while (<FILE>)
{
#line=split('\t',$_);
}
close FILE;
The output of this can be another TSV, that examines the file only until the 5th column, ordered from top to bottom, like:
A 2
C 2
Z 1
Y 1
E 1
J 1
B 1
E 1
K 1
L 1
Note that the first items appear first and, when shared among lines, do not show again for subsequent lines.
Edit: as per the questions in the comments, I changed the dataset and output. Note that two E appear: one belonging to the third column, the other belonging to the second column.
Edit2: Alternatively, this could also be analyzed column by column, thus showing the results in the first column, then in the second, and so on, as long as they were clearly separated. Something like
"1st" "col"
A 2
B 1
"2nd" "col"
C 2
E 1
"3rd" "col"
Z 1
E 1
K 1
"4th" "col"
Y 1
J 1
L 1
I did not fully understand the formatting of your desired output, so the below script outputs all the data from the first col on the first row, and so on. This can easily be modified to the format that you desire, but is a quick starting point to how to acummulate the data first and then processing it.
use strict;
use warnings;
use autodie;
my $path='My:\Path\To\My\File.txt';
open my $fh, '<', $path;
my #data;
# while (<$fh>) { Switch these lines when ready for real data
while (<DATA>) {
my #row = split ' ';
for my $col (0..$#row) {
$data[$col]{$row[$col]}++;
}
}
for my $coldata (#data) {
for my $letter (sort keys %$coldata) {
print "$letter $coldata->{$letter} ";
}
print "\n";
}
close $fh;
__DATA__
A C Z Y X
A C D J
B E K L M
Outputs
A 2 B 1
C 2 E 1
D 1 K 1 Z 1
J 1 L 1 Y 1
M 1 X 1
Perhaps the following will be helpful:
use strict;
use warnings;
my $path = 'My:\Path\To\My\File.txt';
my %hash;
open my $fh, '<', $path or die $!;
while (<$fh>) {
my #cols = split ' ', $_, 5;
$hash{$_}{ $cols[$_] || '' }++ for 0 .. 3;
}
close $fh;
for my $key ( sort { $a <=> $b } keys %hash ) {
print "Col ", $key + 1, "\n";
print "$_ $hash{$key}{$_}\n"
for sort { $hash{$key}->{$b} <=> $hash{$key}->{$a} } grep $_,
keys %{ $hash{$key} };
}
Output on your dataset:
Col 1
A 2
B 1
Col 2
C 2
E 1
Col 3
Z 1
K 1
E 1
Col 4
J 1
L 1
Y 1
Related
I have the following code which reads in a 6x6 array from STDIN and saves it as an array of anonymous arrays. I am trying to print out each element with $arr[i][j], but the code below isn't working. It just prints out the first element over and over. How am I not accessing the element correctly?
#!/user/bin/perl
my $arr_i = 0;
my #arr = ();
while ($arr_i < 6){
my $arr_temp = <STDIN>;
my #arr_t = split / /, $arr_temp;
chomp #arr_t;
push #arr,\#arr_t;
$arr_i++;
}
foreach my $i (0..5){
foreach my $j (0..5){
print $arr[i][j] . "\n";
}
}
i and j are not the same as the variables you declared in the foreach lines. Change:
print $arr[i][j] . "\n";
to:
print $arr[$i][$j] . "\n";
warnings alerted me to this issue. You should add these lines to all your Perl code:
use warnings;
use strict;
To demonstrate the Perlish mantra that there's "more than one way to do it":
use 5.10.0; # so can use "say"
use strict;
use warnings qw(all);
sub get_data {
my ($cols, $rows) = #_;
my ($line, #rows);
my $i;
for ($i = 1; $i <= $rows and $line = <DATA>; $i++) {
chomp $line;
my $cells = [ split ' ', $line ];
die "Row $i had ", scalar(#$cells), " instead of $cols" if #$cells != $cols;
push #rows, $cells;
}
die "Not enough rows, got ", $i - 1, "\n" if $i != $rows + 1;
\#rows;
}
sub print_data {
my ($cols, $rows, $data) = #_;
for (my $i = 0; $i < $rows; $i++) {
for (my $j = 0; $j < $cols; $j++) {
say $data->[$i][$j];
}
}
}
my $data = get_data(6, 6);
print_data(6, 6, $data);
__DATA__
1 2 3 4 5 6
a b c d e f
6 5 4 3 2 1
f e d c b a
A B C D E F
7 8 9 10 11 12
Explanation:
if we use say, that avoids unsightly print ..., "\n"
get_data is a function that can be called and/or reused, instead of just being part of the main script
get_data knows what data-shape it expects and throws an error if it doesn't get it
[ ... ] creates an anonymous array and returns a reference to it
get_data returns an array-reference so data isn't copied
print_data is a function too
both functions use a conventional for loop instead of making lists of numbers, which in Perl 5 needs to allocate memory
There is also a two-line version of the program (with surrounding bits, and test data):
use 5.10.0; # so can use "say"
my #lines = map { [ split ' ', <DATA> ] } (1..6);
map { say join ' ', map qq{"$_"}, #$_ } #lines;
__DATA__
1 2 3 4 5 6
a b c d e f
6 5 4 3 2 1
f e d c b a
A B C D E F
7 8 9 10 11 12
Explanation:
using map is the premier way to iterate over lists of things where you don't need to know how many you've seen (otherwise, a for loop is needed)
the adding of " around the cell contents is only to prove they've been processed. Otherwise the second line could just be: map { say join ' ', #$_ } #lines;
I'm trying to write a script that converts every allele (A,T,G, or C) in my file to 0 or 1 depending on its ancestral state at that position, which is saved in another file "DAF.txt"
I have two files. They are ordered based on genomic position but contain different information.
File 1: Alleles
1 2 3 4 ...900000
A G T C G
G A G G C
A A T C C
DAF.txt: Ancestral status
1 A
2 A
3 T
4 G
...900000 C
DAF.txt serves as a reference of sorts for File 1. Every row of file 1 must be compared column by column with each row in DAF.txt
If the letter in column 1,row 1 in file 1 == the letter in row 1 in DAF.txt, then I need to replace that letter or print "0" to a new file in its place, file.hap, else if the letters in the files don't match at that position, then print "1" to file.hap. The order matters. file.hap must be in the same order as file 1.
In the end, file.hap should look like:
1 2 3 4 ...900000
0 1 0 1 1
1 0 1 0 0
0 0 0 1 0
Any suggestions for doing this in perl? It's a big file...
If you have enough memory, you can store the ancestral alleles in an array:
#!/usr/bin/perl
use warnings;
use strict;
open my $DAF, '<', 'DAF.txt' or die $!;
open my $AL, '<', 'alleles' or die $!;
my #ancestral;
while (<$DAF>) {
chomp;
push #ancestral, $_;
}
<$AL>; # Skip the header.
while (my $al_line = <$AL>) {
my #alleles = split ' ', $al_line;
for my $i (0 .. $#alleles) {
print $alleles[$i] eq $ancestral[$i] ? 0 : 1;
print ' ' unless $i == $#alleles;
}
print "\n";
}
I would like to substitute each element in an array with their corresponding hash values. To make it more clear: I have two files 1) ref.tab 2) data.tab.
The reference file contains data like:
A a
B b
C c
D d
The data file contains data like:
1 apple red A
2 orange orange B
3 grapes black C
4 kiwi green D
What I would like to do now using Perl is: Substitute all instances of values in column 4 of data.tab with the corresponding values from ref.tab.
My code is as follows:
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
# Define file containing the reference values:
open DFILE, 'ref.tab' or die "Cannot open data file";
# Store each column to an array:
my #caps;
my #small;
while(<DFILE>) {
my #tmp = split/\t/;
push #caps,$tmp[0];
push #small,$tmp[1];
}
print join(' ', #caps),"\n";
print join(' ', #small),"\n";
# convert individual arrays to hashes:
my %replaceid;
#replaceid{#caps} = #small;
print "$_ $replaceid{$_}\n" for (keys %replaceid);
# Define the file in which column values are to be replaced:
open SFILE,'output.tab' or die "Cannot open source file";
# Store the required columns in an array:
my #col4;
while(<SFILE>) {
my #tmp1 = split/\t/;
push #col4,$tmp1[4];
}
for $_ (0..$#col4) {
if ($_ = keys $replaceid[$col4[$_]]){
~s/$_/values $replaceid[$col4[$_]]/g;
}
}
print "#col4";
close (DFILE);
close (SFILE);
exit;
The above program results in this error:
Use of uninitialized value $tmp1[3] in join or string at replace.pl line 4.
What is the solution?
New issue:
Another issue now. I would like to leave the field blank if there is no respective replacement. Any idea on how this could be done? That is,
ref.tab
A a
B b
C c
D d
F f
data.tab:
1 apple red A
2 orange orange B
3 grapes black C
4 kiwi green D
5 melon yellow E
6 citron green F
Desired output:
1 apple red a
2 orange orange b
3 grapes black c
4 kiwi green d
5 melon yellow
6 citron green f
How can I do this?
New issue, 2
I have another issue now with the AWK solution. It does leave the field blank if there is no match, but I have additional columns after the 4th; so whenever there is no match found, the value in the fifth column gets shifted to the fourth column.
1 apple red a sweet
2 orange orange b sour
3 grapes black c sweet
4 kiwi green d sweet
5 melon yellow sweet
6 citron green f sour
On line 5: Here you can notice what happens; the value in 5th column gets shifted to the 4th column where there is no replacement found.
Value in 4-th column is $tmp1[3], not $tmp1[4]
use strict;
use warnings;
# Define file containing the reference values:
open my $DFILE, '<', 'ref.tab' or die $!;
my %replaceid;
while (<$DFILE>) {
my ($k, $v) = split;
$replaceid{$k} = $v;
}
close $DFILE;
# print "$_ $replaceid{$_}\n" for (keys %replaceid);
# Define the file in which column values are to be replaced:
open my $SFILE, "<", 'data.tab' or die $!;
local $" = "\t"; #"
while(<$SFILE>) {
my #tmp1 = split;
$tmp1[3] = $replaceid{ $tmp1[3] } // qq{"no '$tmp1[3]' key in \$replaceid!"};
# tab separated output of #tmp1 array, thanks to $" var set above
print "#tmp1\n";
}
close $SFILE;
Perl solution:
use strict;
use warnings;
# Create your filehandles
open my $REF , '<', 'ref.tab' or die $!;
open my $DATA, '<', 'data.tab' or die $!;
my %replaceid;
# Initialize your hashmap from ref file
while (<$REF>) {
my ($k, $v) = split /\s+/;
$replaceid{$k} = $v;
}
# Read the data file
while(<$DATA>) {
my #tmp = split /\s+/;
next unless exists $replaceid {$tmp[3]}; # If 4th fld exists in hash
$tmp[3] = $replaceid{$tmp[3]} or next; # Replace your current line with hash value
print join("\t", #tmp), "\n"; # Print your current line
}
close $REF;
close $DATA;
AWK solution:
awk 'NR==FNR{a[$1]=$2;next}{$4=(a[$4])?a[$4]:""}1' OFS="\t" ref.tab data.tab
We read the ref.tab file completely and load it in a hash having column 1 as key and column 2 as value.
Once the ref.tab file is read, we move to data.tab file and substitute the 4th column with hash value.
I am back with another question. I have a list of data:
1 L DIELTQSPE H EVQLQESDAELVKPGASVKISCKASGYTFTDHE
2 L DIVLTQSPRVT H EVQLQQSGAELVKPGASIKDTY
3 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAN
4 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAG
5 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C LELDKWASL
6 L DIQMTQIPSSLSASLSIC H EVQLQQSGVEVKMSCKASGYTFTS
7 L SYELTQPPSVSVSPGSIT H QVQLVQSAKGSGYSFS P YNKRKAFYTTKNIIG
8 L SYELTQPPSVSVSPGRIT H EVQLVQSGAASGYSFS P NNTRKAFYATGDIIG
9 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
10 A MPIMGSSVVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
11 L DVVMTQTPLQ H EVKLDESVTVTSSTWPSQSITCNVAHPASSTKVDKKIE
12 A DIVMTQSPDAQYYSTPYSFGQGTKLEIKR
And I would like to compare the 3rd elements && 5th elements of each row, then group them if they have the same 3rd && 5th elements.
For example, with the data above, the results will be :
3: 3 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAN
4 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAG
5 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C LELDKWASL
9: 9 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
10 A MPIMGSSVVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
Fyi, in the actual data, the 3rd, 5th, 7th elements are very long. I have made them cut to see the whole.
This is what I have done, I know it is very clumsy, but as a beginner, I am doing my best.
And the problem is that it shows only the first set of 'same' group.
Could you show me where it went wrong and/or other pretty methods to solve this, please?
my $file = <>;
open(IN, $file)|| die "no $file: $!\n";
my #arr;
while (my $line=<IN>){
push #arr, [split (/\s+/, $line)] ;
}
close IN;
my (#temp1, #temp2,%hash1);
for (my $i=0;$i<=$#arr ;$i++) {
push #temp1, [$arr[$i][2], $arr[$i][4]];
for (my $j=$i+1;$j<=$#arr ;$j++) {
push #temp2, [$arr[$j][2], $arr[$j][4]];
if (($temp1[$i][0] eq $temp2[$j][0])&& ($temp1[$i][1] eq $temp2[$j][1])) {
push #{$hash1{$arr[$i][0]}}, $arr[$i], $arr[$j];
}
}
}
print Dumper \%hash1;
You appear to have overcomplicated this a bit more than it needs to be, but that's common for beginners. Think more about how you would do this manually:
Look at each line.
See whether the third and fifth fields are the same as the previous line.
If so, print them.
The looping and all that is completely unnecessary:
#!/usr/bin/env perl
use strict;
use warnings;
my ($previous_row, $third, $fifth) = ('') x 3;
while (<DATA>) {
my #fields = split;
if ($fields[2] eq $third && $fields[4] eq $fifth) {
print $previous_row if $previous_row;
print "\t$_";
$previous_row = '';
} else {
$previous_row = $fields[0] . "\t" . $_;
$third = $fields[2];
$fifth = $fields[4];
}
}
__DATA__
1 L DIELTQSPE H EVQLQESDAELVKPGASVKISCKASGYTFTDHE
2 L DIVLTQSPRVT H EVQLQQSGAELVKPGASIKDTY
3 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAN
4 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAG
5 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C LELDKWASL
6 L DIQMTQIPSSLSASLSIC H EVQLQQSGVEVKMSCKASGYTFTS
7 L SYELTQPPSVSVSPGSIT H QVQLVQSAKGSGYSFS P YNKRKAFYTTKNIIG
8 L SYELTQPPSVSVSPGRIT H EVQLVQSGAASGYSFS P NNTRKAFYATGDIIG
9 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
10 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
11 L DVVMTQTPLQ H EVKLDESVTVTSSTWPSQSITCNVAHPASSTKVDKKIE
12 A DIVMTQSPDAQYYSTPYSFGQGTKLEIKR
(Note that I changed line 10 slightly so that its third field will match line 9 in order to get the same groups in the output as specified.)
Edit: One line of code was duplicated by a copy/paste error.
Edit 2: In response to comments, here's a second version which doesn't assume that the lines which should be grouped are contiguous:
#!/usr/bin/env perl
use strict;
use warnings;
my #lines;
while (<DATA>) {
push #lines, [ $_, split ];
}
# Sort #lines based on third and fifth fields (alphabetically), then on
# first field/line number (numerically) when third and fifth fields match
#lines = sort {
$a->[3] cmp $b->[3] || $a->[5] cmp $b->[5] || $a->[1] <=> $b->[1]
} #lines;
my ($previous_row, $third, $fifth) = ('') x 3;
for (#lines) {
if ($_->[3] eq $third && $_->[5] eq $fifth) {
print $previous_row if $previous_row;
print "\t$_->[0]";
$previous_row = '';
} else {
$previous_row = $_->[1] . "\t" . $_->[0];
$third = $_->[3];
$fifth = $_->[5];
}
}
__DATA__
1 L DIELTQSPE H EVQLQESDAELVKPGASVKISCKASGYTFTDHE
3 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAN
2 L DIVLTQSPRVT H EVQLQQSGAELVKPGASIKDTY
5 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C LELDKWASL
7 L SYELTQPPSVSVSPGSIT H QVQLVQSAKGSGYSFS P YNKRKAFYTTKNIIG
6 L DIQMTQIPSSLSASLSIC H EVQLQQSGVEVKMSCKASGYTFTS
9 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
8 L SYELTQPPSVSVSPGRIT H EVQLVQSGAASGYSFS P NNTRKAFYATGDIIG
11 L DVVMTQTPLQ H EVKLDESVTVTSSTWPSQSITCNVAHPASSTKVDKKIE
10 A MPIMGSSVAVLAIL B DIVMTQSPTVTI C EVQLQQSGRGP
12 A DIVMTQSPDAQYYSTPYSFGQGTKLEIKR
4 A ALQLTQSPSSLSAS B RITLKESGPPLVKPTCS C ELDKWAG
Slightly different approach:
#!/usr/bin/perl
use strict;
use warnings;
my %lines; # hash with 3rd and 5th elements as key
my %first_line_per_group; # stores in which line a group appeared first
while(my $line = <>) {
# remove line break
chomp $line;
# retrieve elements form line
my #elements = split /\s+/, $line;
# ignore invalid lines
next if #elements < 5;
# build key from elements 3 and 5 (array 0-based!)
my $key = $elements[2] . " " . $elements[4];
if(! $lines{key}) {
$first_line_per_group{$key} = $elements[0];
}
push #{ $lines{$key} }, $line;
}
# output
for my $key (keys %lines) {
print $first_line_per_group{$key} . ":\n";
print " $_\n" for #{ $lines{$key} };
}
Example:
use strict;
use warnings;
{ ... }
open my $fh, '<', $file or die "can't open $file: $!";
my %hash;
# read and save it
while(my $line = <$fh>){
my #line = split /\s+/, $line;
my $key = $line[2] . ' ' . $line[4];
$hash{$key} ||= [];
push #{$hash{$key}}, $line;
}
# remove single elements
for my $key (keys %hash){
delete $hash{$key} if #{$hash{$key}} < 2;
}
print Dumper \%hash;
Your approach shows a pretty solid grasp of Perl idiom and has merit, but still is not how I would do it.
I think that you will have an easier time with this if you structure your data slightly differently: Let %hash1 be something like
(
'ALQLTQSPSSLSAS' => {
'RITLKESGPPLVKPTCS' => [3, 4, 5],
'ABCXYZ' => [93, 95, 96],
},
'MPIMGSSVAVLAIL' => {
'DIVMTQSPTVTI' => [9, 10],
},
)
where I have added a datum ABCXYZ which is not in your example to show the data structure in its fullness.
You should be using the 3-argument form of open() and you can simplify reading in the data:
open my $fh, '<', $file
or die "Cannot open '$file': $!\n";
chomp(my #rows = <$fh>);
#rows = map {[split]} #rows;
close $fh;
To group the rows, you can use a hash with the 3rd and 5th fields concatenated as the keys. Edit: You have to add a separation character to eliminate invalid results "if different lines produce the same concatenation" (Qtax). Additional data, for example, the number of the individual data rows, can be stored as the hash value. Here, the row's fields are stored:
my %groups;
for (#rows) {
push #{ $groups{$_->[2] . ' ' . $_->[4]} }, $_
if #$_ >= 4;
}
Sort out single elements:
#{ $groups{$_} } < 2 && delete $groups{$_}
for keys %groups;
greets,
Matthias
I am currently trying to pass a 32 by 48 matrix file to a multi-dimensional array in Perl. I am able to access all of the values, but I am having issues accessing a specific value.
Here is a link to the data set:
http://paste-it.net/public/x1d5301/
Here is what I have for code right now.
#!/usr/bin/perl
open FILE, "testset.txt" or die $!;
my #lines = <FILE>;
my $size = scalar #lines;
my #matrix = (1 .. 32);
my $i = 0;
my $j = 0;
my #micro;
foreach ($matrix)
{
foreach ($lines)
{
push #{$micro[$matrix]}, $lines;
}
}
It doesn't seem you understand that $matrix only indicates #matrix when it is immediately followed by an array indexer: [ $slot ]. Otherwise, $matrix is a completely different variable from #matrix (and both different from %matrix as well). See perldata.
#!/usr/bin/perl
use English;
Don't! use English--that way!
This brings in $MATCH, $PREMATCH, and $POSTMATCH and incurs the dreaded $&, $`, $' penalty. You should wait until you're using an English variable and then just import that.
open FILE, "testset.txt" or die $!;
Two things: 1) use lexical file handles, and 2) use the three-argument open.
my #lines = <FILE>;
As long as I'm picking: Don't slurp big files. (Not the case here, but it's a good warning.)
my $size = scalar #lines;
my #matrix = (1 .. 32);
my $i = 0;
my $j = 0;
my #micro;
I see we're at the "PROFIT!!" stage here...
foreach ($matrix) {
You don't have a variable $matrix; you have a variable #matrix.
foreach ($lines) {
The same thing is true with $lines.
push #{ $micro[$matrix]}, $lines;
}
}
Rewrite:
use strict;
use warnings;
use English qw<$OS_ERROR>; # $!
open( my $input, '<', 'testset.txt' ) or die $OS_ERROR;
# I'm going to assume space-delimited, since you don't show
my #matrix;
# while ( defined( $_ = <$input> ))...
while ( <$input> ) {
chomp; # strip off the record separator
# Load each slot of #matrix with a reference to an array filled with
# the line split by spaces.
push #matrix, [ split ]; # split = split( ' ', $_ )
}
If you are going to be doing quite a bit of math, you might consider PDL (the Perl Data Language). You can easily set up your matrix and before operations on it:
use 5.010;
use PDL;
use PDL::Matrix;
my #rows;
while( <DATA> ) {
chomp;
my #row = split /\s+/;
push #rows, \#row;
}
my $a = PDL::Matrix->pdl( \#rows );
say "Start ", $a;
$a->index2d( 1, 2 ) .= 999;
say "(1,2) to 999 ", $a;
$a++;
say "Increment all ", $a;
__DATA__
1 2 3
4 5 6
7 8 9
2 3 4
The output shows the matrix evolution:
Start
[
[1 2 3]
[4 5 6]
[7 8 9]
[2 3 4]
]
(1,2) to 999
[
[ 1 2 3]
[ 4 5 999]
[ 7 8 9]
[ 2 3 4]
]
Increment all
[
[ 2 3 4]
[ 5 6 1000]
[ 8 9 10]
[ 3 4 5]
]
There's quite a bit of power to run arbitrary and complex operations on every member of the matrix just like I added 1 to every member. You completely skip the looping acrobatics.
Not only that, PDL does a lot of special stuff to make math really fast and to have a low memory footprint. Some of the stuff you want to do may already be implemented.
You probably need to chomp the values:
chomp( my #lines = <FILE> );
To clarify a tangential point to Axeman's answer:
See perldoc -f split:
A split on /\s+/ is like a split(' ') except that any leading whitespace produces a null first field. A split with no arguments really does a split(' ', $_) internally.
#!/usr/bin/perl
use YAML;
$_ = "\t1 2\n3\f4\r5\n";
print Dump { 'split' => [ split ] },
{ "split ' '" => [ split ' ' ] },
{ 'split /\s+/' => [ split /\s+/ ] }
;
Output:
---
split:
- 1
- 2
- 3
- 4
- 5
---
split ' ':
- 1
- 2
- 3
- 4
- 5
---
split /\s+/:
- ''
- 1
- 2
- 3
- 4
- 5
I see the question is pretty old, but as the author has just edited the question, perhaps this is still of interest. Also the link to the data is dead, but since other answers use space as the separator, I will too.
This answer demonstrates Tie::Array::CSV which allows random access to a CSV (or other file parsable with Text::CSV).
#!/usr/bin/env perl
use strict;
use warnings;
use Tie::Array::CSV;
## put DATA into temporary file
## if not using DATA, put file name in $file
use File::Temp ();
my $file = File::Temp->new();
print $file <DATA>;
##
tie my #data, 'Tie::Array::CSV', $file, {
text_csv => {
sep_char => " ",
},
};
print $data[1][2];
__DATA__
1 2 3 4 5
6 7 8 9 1
2 3 4 5 6