Perl: Inserting values into specific columns of CSV file - arrays

I have CSV data of the form:
S.No,Label,Customer1,Customer2,Customer3...
1,label1,Y,N,Y
2,label2,N,Y,N
...
I need to reproduce the "label" to the left of "customer" columns marked with Y - and have nothing ("") to the left of columns marked with N.
Expected output:
S.No,Label,Customer1,Customer1,Customer2,Customer2,Customer3,Customer3...
1,label1,label1,Y,"",N,label1,Y
2,label2,"",N,label2,Y,"",N
When opened using Excel, it would look like this:
S.No Label Customer1 Customer1 Customer2 Customer2 Customer3 Customer3...
1 label1 label1 Y N label1 Y
2 label2 N label2 Y N
The two leftmost columns, referring to S.No and the original "Label" column, are constant.
What is the simplest way to do this? I tried the following code:
use strict;
use warnings;
my $nonIncludesFile = "nonIncludes.csv";
open(my $xfh, "+>", $nonIncludesFile) or warn "Unable to open $nonIncludesFile, $!";
chomp( my $header = <$xfh> );
my #names = split ",", $header;
my #names1;
my #fields;
my #fields1;
for(my $j=0; $j< scalar(#names); $j++)
{
$names1[$j] = $names[$j];
}
while(<$xfh>)
{
my $nonIncLine = $_;
$nonIncLine = chomp($nonIncLine);
#fields = split ",", $nonIncLine;
next if $. == 1; #skip the first line
for(my $i = 0; $i < scalar(#fields) -2; $i++) #Number of "customers" = scalar(#fields) -2
{
$fields1[0] = $fields[0];
$fields1[1] = $fields[1];
if('Y' eq $fields[ $i + 2 ])
{
$fields1[$i+2] = 'Y';
substr(#fields1, $i + 1, 0, $fields[1]); #insert the label to the left - HERE
}
else
{
$fields1[$i+2] = 'N';
substr(#fields1, $i + 1, 0, "");
}
}
}
print $xfh #names1;
print $xfh #fields1;
close($xfh);
This however complains of "substr outside of string" at the line marked by "HERE".
What am I doing wrong? And is there any simpler (and better) way to do this?

Something like this maybe?
#!/usr/bin/perl
use strict;
use warnings;
#read the header row
chomp( my ( $sn, $label, #customers ) = split( /,/, <DATA> ) );
#double the 'customers' column headings (one is suffixed "_label")
print join( ",", $sn, $label, map { $_ . "_label", $_ } #customers ), "\n";
#iterate data
while (<DATA>) {
#strip trailing linefeed
chomp;
#extract fields with split - note breaks if you've quoted commas inline.
my ( $sn, $label, #row ) = split /,/;
print "$sn,$label,";
#iterate Y/N values, and either prints "Y" + label, or anything else + blank.
foreach my $value (#row) {
print join( ",", $value eq "Y" ? $label : "", $value ),",";
}
print "\n";
}
__DATA__
S.No,Label,Customer1,Customer2,Customer3
1,label1,Y,N,Y
2,label2,N,Y,N
Assumes you don't have any fruity special characters (e.g. commas) in the fields, because it'll break if you do, and you might want to consider Text::CSV instead.

It is always much better to post some usable test data than write a something like this question
However, it looks like your data has no quoted fields or escaped characters, so it looks like you can just use split and join to process the CSV data
Here's a sample Perl program that fulfils your requirement. The example output uses your data as it is. Each line of data has to be processed backwards so that the insertions don't affect the indices of elements that are yet to be processed
use strict;
use warnings 'all';
use feature 'say';
while ( <DATA> ) {
chomp;
my #fields = split /,/;
for ( my $i = $#fields; $i > 1; --$i ) {
my $newval =
$. == 1 ? $fields[$i] :
lc $fields[$i] eq 'y' ? $fields[1] :
'';
splice #fields, $i, 0, $newval;
}
say join ',', #fields;
}
__DATA__
S.No,Label,Customer1,Customer2,Customer3...
1,label1,Y,N,Y
2,label2,N,Y,N
output
S.No,Label,Customer1,Customer1,Customer2,Customer2,Customer3...,Customer3...
1,label1,label1,Y,,N,label1,Y
2,label2,,N,label2,Y,,N

Related

Identify items in hash with matching and non-matching criteria

I have two tab-delimited files:
one is a reference with thousands of entries
and the other is a list of millions of criteria
that are used to search the reference.
I make a hash of the reference file with the following code
use strict;
use warnings;
#use Data::Dumper;
#use Timer::Runtime;
use feature qw( say );
my $in_qfn = $ARGV[0];
my $out_qfn = $ARGV[1];
my $transcripts_qfn = "file";
my %transcripts;
{
open(my $transcripts_fh, "<", $transcripts_qfn)
or die("Can't open \"$transcripts_qfn\": $!\n");
while ( <$transcripts_fh> ) {
chomp;
my #refs = split(/\t/, $_);
my ($ref_chr, $ref_strand) = #refs[0, 6];
my $values = {
start => $refs[3],
end => $refs[4],
info => $refs[8]
};
#print Data::Dumper->Dump([$values]), $/; #confirm structure is fine
push #{ $transcripts{$ref_chr}{$ref_strand} }, $values;
}
}
Then I open the other input file, define the elements, and parse the hash to find matching criteria
while ( <$in_fh> ) {
chomp;
my ($x, $strand, $chr, $y, $z) = split(/\t/, $_);
#match the reference hash for things equal to $chr and $strand
my $transcripts_array = $transcripts{$chr}{$strand};
for my $transcript ( #$transcripts_array ) {
my $start = $transcript->{start};
my $end = $transcript->{end};
my $info = $transcript->{info};
#print $info and other criteria from if statements to outfile, this code works
}
}
This works, but I would like to know if I can then find elements in the hash that match $chr but not $strand (which has a binary value of either sign).
I put the following into the same while block after the previous for, but it does not appear to work
my $transcripts_opposite_strand = $transcripts{$chr}{!$strand};
for my $transcript (#$transcripts_opposite_strand) {
my $start = $transcript->{start};
my $end = $transcript->{end};
my $info = $transcript->{info};
#print $info and other criteria from if statements
}
I apologize for the code snippets; I tried to keep the relevant information. Because of the size of the files I can't really brute force it by going line by line by line.
The negation operator ! enforces boolean context on its argument. "+" and "-" are both true in boolean context, so ! $strand is always false, i.e. "" in string context.
Either store boolean value in the hash
$strand = $strand eq '+';
or don't use boolean negation:
my $transcripts_opposite_strand = $transripts{$chr}{ $strand eq '+' ? '-' : '+' };
The ternary operator can be replaced by a shorter but less readable alternatives, e.g.
qw( + - )[ $strand eq '+' ]
because in numeric context, true is interpreted as 1 and false as 0.

loop through elements of array to find character perl

I have a perl array where I only want to loop through elements 2-8.
The elements are only meant to contain numbers, so if any of those elements contain a letter, I want to set an error flag = 1, as well as some other variables as seen.
The reason I have 2 error flag variables is due to scope rules within the loop.
fields is an array, I created by splitting another irrelevant array by the " " key.
So, when I try to print error_line2, error_fname2 from outside the loop, I get this:
Use of uninitialized value $error_flag2 in numeric eq (==)
I don't know why, because I've initialized the value within the loop and created the variable outside the loop.
Not sure if I'm even looping to find characters correctly, so then it's not setting the error_flag2 = 1.
Example line:
bob hankerman 2039 3232 23 232 645 64x3 324
since element 7 has the letter 'x' , I want the flag to be set to 1.
#!/usr/bin/perl
use strict;
use warnings;
use Scalar::Util qw(looks_like_number);
my $players_file = $ARGV[0];
my #players_array;
open (my $file, "<", "$players_file")
or die "Failed to open file: $!\n";
while(<$file>) {
chomp;
push #players_array, $_;
}
close $file;
#print join "\n", #players_array;
my $num_of_players = #players_array;
my $error_flag;
my $error_line;
my $error_fname;
my $error_lname;
my $error_flag2=1;
my $error_line2;
my $error_fname2;
my $error_lname2;
my $i;
foreach my $player(#players_array){
my #fields = split " ", $player;
my $size2 = #fields;
for($i=2; $i<9; $i++){
print "$fields[$i] \n";
if (grep $_ =~ /^[a-zA-Z]+$/){
my $errorflag2 = 1;
$error_flag2 = $errorflag2;
my $errorline2 = $player +1;
$error_line2 = $errorline2;
my $errorfname2 = $fields[0];
$error_fname2 = $errorfname2;
}
}
if ($size2 == "9" ) {
my $firstname = $fields[0];
my $lastname = $fields[1];
my $batting_average = ($fields[4]+$fields[5]+$fields[6]+$fields[7]) / $fields[3];
my $slugging = ($fields[4]+($fields[5]*2)+($fields[6]*3)+($fields[7]*4)) / $fields[3];
my $on_base_percent = ($fields[4]+$fields[5]+$fields[6]+$fields[7] +$fields[8]) / $fields[2];
print "$firstname ";
print "$lastname ";
print "$batting_average ";
print "$slugging ";
print "$on_base_percent\n ";
}
else {
my $errorflag = 1;
$error_flag = $errorflag;
my $errorline = $player +1;
$error_line = $errorline;
my $errorfname = $fields[0];
$error_fname = $errorfname;
my $errorlname = $fields[1];
$error_lname = $errorlname;
}
}
if ($error_flag == "1"){
print "\n Line $error_line : ";
print "$error_fname, ";
print "$error_lname :";
print "Line contains not enough data.\n";
}
if ($error_flag2 == "1"){
print "\n Line $error_line2 : ";
print "$error_fname2, ";
print "Line contains bad data.\n";
}
OK, so the problem you've got here is that you're thinking of grep in Unix terms - a text based thing. It doesn't work like that in perl - it operates on a list.
Fortunately, this is pretty easy to handle in your case, because you can split your line into words.
Without your source data, this is hopefully a proof of concept:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
while ( <DATA> ) {
#split the current line on whitespace into an array.
#first two elements get assigned to firstname lastname, and then the rest
#goes into #values
my ( $firstname, $lastname, #values ) = split; #works on $_ implicitly.
#check every element in #values, and test the regex 'non-digit' against it.
my #errors = grep { /\D/ } #values;
#output any matches e.g. things that contained 'non-digits' anywhere.
print Dumper \#errors;
#an array in a scalar context evaluates as the number of elements.
#we need to use "scalar" here because print accepts list arguments.
print "There were ", scalar #errors, " errors\n";
}
__DATA__
bob hankerman 2039 3232 23 232 645 64x3 324
Or reducing down the logic:
#!/usr/bin/perl
use strict;
use warnings;
while ( <DATA> ) {
#note - we don't need to explicity specify 'scalar' here,
#because assigning it to a scalar does that automatically.
#(split) splits the current line, and [2..8] skips the first two.
my $count_of_errors = grep { /\D/ } (split)[2..8];
print $count_of_errors;
}
__DATA__
bob hankerman 2039 3232 23 232 645 64x3 324
First : You don't need to use "GREP", Simply you can match the string with "=~" in perl and you can print matched value with $&.
Second : You should use $_ if and only if there is not other variable used in the loop. There is already $i used in the loop, you can write the loop as :
for my $i (2..9) {
print "$i\n";
}
or
foreach(2..9) {
print "$_\n";
}

Add all values in array for each ID in Perl

I have this table:
NAME |12/31/2016|VALUE
AAA |1/31/2017 |10
AAA |2/1/2017 |20
AAA |2/2/2017 |30
AAA |2/3/2017 |40
AAA |2/4/2017 |50
NAME |2/9/2017 |VALUE
BBB |2/10/2017 |20
BBB |2/11/2017 |30
BBB |2/12/2017 |40
BBB |2/13/2017 |50
BBB |2/14/2017 |60
and this would be my desired output:
NAME |DATE |VALUE
AAA |12/31/2016 |150
AAA |1/31/2017 |140
AAA |2/1/2017 |120
NAME |DATE |VALUE
BBB |2/9/2017 |200
BBB |2/10/2017 |180
BBB |2/11/2017 |150
What I want to do is, for each of the valid symbols, (AAA, BBB) I want to have three rows.
For the first row of each column, I want all the values added,
For example, row 1 value for AAA:
10+20+30+40+50 = 150
then for row 2 I want to just add from the second value to the last.
For example row 2 value for AAA
20+30+40+50 = 140
and so on same goes for BBB.
I want to shift the dates down so that 12/31/2016 would match AAA, then get the first three dates for each row.
I currently have this code. but this doesn't do much. it just gives me a bunch of numbers.
use strict;
use warnings;
use Scalar::Util qw(looks_like_number);
use Data::Dumper;
sub uniq {
my %seen;
grep !$seen{$_}++, #_;
}
my %cashflow;
my %fields = (
ID => 0,
DATES => 1,
VALUE => 2,
);
my #total;
my #IDs;
my #uniqueIDs;
my #dates;
my #add;
my $i = 0;
my #values;
my $counter = 3;
open( FILE, "try.CSV" );
while ( my $line = <FILE> ) {
chomp( $line );
my #lineVals = split( /\|/, $line );
if ( $lineVals[ $fields{ID} ] !~ /^SYMBOL$/i ) {
push #IDs, $lineVals[ $fields{ID} ];
}
#uniqueIDs = uniq( #IDs );
#push all CASH FLOW AMOUNTS to #cashflow
if ( looks_like_number( $lineVals[ $fields{VALUE} ] ) ) {
$lineVals[ $fields{VALUE} ] =~ s/\r//;
push #total, $lineVals[ $fields{VALUE} ];
}
if ( $lineVals[ $fields{DATES} ] =~ /(\d{1,2})\/(\d{1,2})\/(\d{4})/ ) {
$lineVals[ $fields{DATES} ] = sprintf( '%04d%02d%02d', $3, $2, $1 );
}
$cashflow{ uc $lineVals[ $fields{ID} ] }{DATES} = $lineVals[ $fields{DATES} ];
$cashflow{ uc $lineVals[ $fields{ID} ] }{VALUE} = $lineVals[ $fields{VALUE} ];
foreach my $ID ( #uniqueIDs ) {
foreach my $symb ( keys %cashflow ) {
if ( $ID = $symb ) {
if ( looks_like_number( $lineVals[ $fields{VALUE} ] ) ) {
$lineVals[ $fields{VALUE} ] =~ s/\r//;
push #total, $lineVals[ $fields{VALUE} ];
my $i = 0;
my $grand = 0;
foreach my $val ( #total ) {
while ( $i < $counter ) {
$grand += $val;
print "$grand \n";
$i++;
}
shift #total;
}
}
}
}
}
}
close FILE;
I'm really stuck with this. I don't know what to do with the problem.
A possible solution:
#!perl
use strict;
use warnings;
sub trim {
my ($str) = #_;
s!\A\s+!!, s!\s+\z!! for $str;
$str
}
my $file = 'try.CSV';
open my $fh, '<', $file or die "$0: $file: $!\n";
my ($group_name, #dates, #values);
my $sum = 0;
my $print_group = sub {
return if !defined $group_name;
my $format = " %-6s|%-11s|%s\n";
printf $format, 'NAME', 'DATE', 'VALUE';
for my $date (#dates) {
printf $format, $group_name, $date, $sum;
$sum -= shift #values if #values;
}
};
while (my $line = readline $fh) {
my ($name, $date, $value) = map trim($_), split /\|/, $line;
if ($name eq 'NAME') {
$print_group->();
$group_name = undef;
#dates = $date;
#values = ();
$sum = 0;
next;
}
$group_name ||= $name;
push #dates, $date if #dates < 3;
push #values, $value if #values < 2;
$sum += $value;
}
$print_group->();
Let's go over it.
sub trim {
my ($str) = #_;
s!\A\s+!!, s!\s+\z!! for $str;
$str
}
A helper function for removing leading/trailing whitespace from a string. We're using ! as the s delimiter here because / breaks SO's syntax highlighting. Shrug.
my $file = 'try.CSV';
open my $fh, '<', $file or die "$0: $file: $!\n";
Open our input file. Note: We use a lexical variable ($fh) instead of a bareword filehandle, and we use 3-argument open. This is strongly recommended. We also check open's return value and produce a nice error message in case of failure, including both the name of the file that couldn't be opened ($file) and the reason for failing ($!).
my ($group_name, #dates, #values);
my $sum = 0;
We set up some state variables that we want to preserve across loop iterations. $group_name is the name of the group we're currently processing, #dates is the saved dates we've seen so far, #values is the saved values we've seen so far. $sum is a running sum of all the values in the current group, and it starts at 0.
my $print_group = sub {
return if !defined $group_name;
my $format = " %-6s|%-11s|%s\n";
printf $format, 'NAME', 'DATE', 'VALUE';
for my $date (#dates) {
printf $format, $group_name, $date, $sum;
$sum -= shift #values if #values;
}
};
A helper function for printing the output for a single group. If $group_name isn't set, we haven't processed any input for the current group yet, so we do nothing and return. Otherwise we print a NAME | DATE | VALUE header, followed by a row of data for each element in #dates. For each $date we output the current group name (e.g. AAA), $date, and the sum of values (all nicely formatted using printf). Initially $sum is the sum of all group values, but after the first iteration we start subtracting the values from #values: If the list of values in the input was x1, x2, x3, x4, ..., then $sum is initially x1 + x2 + x3 + x4 + ..., and that's what's printed in the first line of output. After that we subtract x1, so the next line gets x1 + x2 + x3 + x4 + ... - x1, which is x2 + x3 + x4 + .... After that we subtract x2, so the third row of data gets x3 + x4 + ....
while (my $line = readline $fh) {
my ($name, $date, $value) = map trim($_), split /\|/, $line;
Our main loop. We read a line of input, split it on |, and trim each field.
if ($name eq 'NAME') {
$print_group->();
$group_name = undef;
#dates = $date;
#values = ();
$sum = 0;
next;
}
If $name is 'NAME', this is the start of a new group. Print the output for the current group if any ($print_group->() does nothing if there is no current group), then reset our state variables back to initial values, except for #dates, which is filled with the $date value from the header row. Then start the next iteration of the loop because we're done with this line.
$group_name ||= $name;
push #dates, $date if #dates < 3;
push #values, $value if #values < 2;
$sum += $value;
If we get here, this line is not the start of a new group. We set $group_name if it hasn't been set yet. We add $date to our list of saved dates (but we only need 3 dates, so do nothing if we already have 3). We add $value to our list of saved values (but we only need 2 of them). Finally we add $value to our total $sum within the group.
}
$print_group->();
At the end of the loop we've also just finished processing a group, so we need to call $print_group here as well.
This will do as you ask. It reads the whole data file into an array of arrays and manipulates that array before printing it. The blocks are processed backwards from the end so that the other blocks remain in place when the trailing lines are deleted
This program expects the path to the input file as a parameter on the command line and writes the result to STDOUT
use strict;
use warnings 'all';
my #data = map [ /[^|\s]+/g ], <>;
# Make a list of the indices of all the header rows
my #headers = grep { $data[$_][0] eq 'NAME' } 0 .. $#data;
# Make a list of the indices of the first
# and last lines of all the data blocks
my #blocks = map {
[
$headers[$_] + 1,
$_ == $#headers ? $#data : $headers[$_+1] - 1
]
} 0 .. $#headers;
# Shift the second column down
# Replace the col2 header with 'DATE'
#
$data[$_][1] = $data[$_-1][1] for reverse 1 .. $#data;
$data[$_][1] = 'DATE' for #headers;
# Edit each block of data
#
for my $block ( reverse #blocks ) {
my ( $beg, $end ) = #$block;
# Calculate the block total
my $total = 0;
for ( $beg ... $end ) {
$total += $data[$_][2];
}
# Calculate the first three data values
for my $i ( $beg .. $beg + 2 ) {
my $next = $total - $data[$i][2];
$data[$i][2] = $total;
$total = $next;
}
# Remove everything except those three lines
splice #data, $beg+3, $end-$beg-2;
}
print join('|', #$_), "\n" for #data;
output
NAME|DATE|VALUE
AAA|12/31/2016|150
AAA|1/31/2017|140
AAA|2/1/2017|120
NAME|DATE|VALUE
BBB|2/9/2017|200
BBB|2/10/2017|180
BBB|2/11/2017|150

Pulling out potentially overlapping subsets of elements in array to make smaller arrays

My input file looks like below (real one is much larger):
rs3683945_mark 0
rs6336442_mark 1E-07
rs31328150_impute 0.444121193
rs3658242_mark 0.444121293
rs39342374_impute 0.444121393
IMP!1! 1
rs3677817_mark 1.986015679
IMP!2! 2
SNP117_impute 2.685815665
IMP!3! 3
SNP3_1_impute 3.643119709
SNP1_impute 3.643119809
rs13475706_mark 3.643119909
13 lines, two elements each line. First element is a name. Each name ends either with a "tag" _mark or impute, or there is no tag. The point of the tag is to distinguish between types of names, which form the basis of my search for subsets within the entire list.
The subsets begin with a _mark name that immediately precedes an instance of an _impute name. The subsets end with the very next instance of _mark. All names in between, which will necessarily not have any such tag, also go into a subset, which I'd like to collect into an array and send off to a subroutine to process (details of that not important). Please note, the positions with IMP in the name are not the same as those actually tagged with a _impute.
For example, with the above, the first useable subset is:
rs6336442_mark 1E-07
rs31328150_impute 0.444121193
rs3658242_mark 0.444121293
The second useable subset is:
rs3658242_mark 0.444121293
rs39342374_impute 0.444121393
IMP!1! 1
rs3677817_mark 1.986015679
and so on... EDIT: Note that last _mark name of the first set is the first _mark name of the second.
My code for this:
#!/usr/bin/perl
use strict; use warnings;
my $usage = "usage: merge_impute.pl {genotype file} {distances file} \n";
die $usage unless #ARGV == 2;
my $genotypes = $ARGV[0];
open (FILE, "<$genotypes");
my #genotypes = <FILE>;
close FILE;
my $distances = $ARGV[1];
open (DISTS, "<$distances");
my #distances = <DISTS>;
close DISTS;
my #workingset = ();
#print scalar #distances;
for ( my $i = 0; $i < scalar #distances; $i++ ){
chomp $distances[$i];
#print "$distances[$i]\n";
if ( $distances[$i] =~ m/impute/ ){
push ( #workingset,$distances[$i-1],$distances[$i],$distances[$i+1]);
}
print "i=$i: #workingset\n";
# at this point send off to sub routine
#workingset=();
}
As you can see, the if loop is only set up to find subsets that contain only one _impute name. How can I modify the code so that a subset will "fill up" with as many names as required until we arrive at the next _mark name?
EDIT: Perhaps instead of the for loop, I could something like...
push (#workingset, $distances[0], $distances[1] );
until ( $distance[ ??? ] =~ m/_mark/ ){
push ( #workingset, $distance[ ??? ] );
}
But what could $distances[ ??? ] be?
EDIT: Or an alternative for loop...
push (#workingset, $distances[0] );
for ( my $i = 1; $i < scalar #distances - 1 ; $i++ ){
until ( $distances[ $i ] =~ m/_mark/ ){
push ( #workingset, $distances[ $i ] );
# send #workingset to sub routine
#clear workingset
#workingset = ();
}
}
Though this isn't working.
I also tried...
push (#workingset, $distances[0] );
for ( my $i = 1; $i < scalar #distances - 1 ; $i++ ){
until ( $distances[ $i ] =~ m/_mark/ ){
push ( #workingset, $distances[ $i ] );
next if $distances[ $i+1 ] !~ /_mark/;
}
# send #workingset to sub routine here
print "i=$i, #workingset\n\n";
#clear workingset
#workingset = ();
}
I don't have a lot of time right now but I'll hopefully have some time in the morning to check back. Here's a quick example on how you could do it (it is meant to be simple and easy to understand, not fancy). Hopefully it helps you get on the right track for parsing the data.
use strict;
use warnings;
my $first_mark;
my #workingset = ();
my $second_mark;
while (<DATA>){
chomp;
if ( /_mark/ and scalar #workingset == 0 ) {
$first_mark = $_;
} elsif ( /IMP|_impute/ and defined $first_mark) {
push #workingset, $_;
} elsif ( /_mark/ and defined $first_mark) {
$second_mark = $_;
print "Found valid set: ";
print "$first_mark," . join(",", #workingset) . ",$second_mark\n";
#workingset = ();
$first_mark = $second_mark;
undef $second_mark;
}
}
__DATA__
rs3683945_mark 0
rs6336442_mark 1E-07
rs31328150_impute 0.444121193
rs3658242_mark 0.444121293
rs39342374_impute 0.444121393
IMP!1! 1
rs3677817_mark 1.986015679
IMP!2! 2
SNP117_impute 2.685815665
IMP!3! 3
SNP3_1_impute 3.643119709
SNP1_impute 3.643119809
rs13475706_mark 3.643119909
Output:
Found valid set: rs6336442_mark 1E-07,rs31328150_impute 0.444121193,rs3658242_mark 0.444121293
Found valid set: rs3658242_mark 0.444121293,rs39342374_impute 0.444121393,IMP!1! 1,rs3677817_mark 1.986015679
Found valid set: rs3677817_mark 1.986015679,IMP!2! 2,SNP117_impute 2.685815665,IMP!3! 3,SNP3_1_impute 3.643119709,SNP1_impute 3.643119809,rs13475706_mark 3.643119909

Read CSV file and save in 2 d array

I am trying to read a huge CSV file in 2 D array, there must be a better way to split the line and save it in the 2 D array in one step :s
Cheers
my $j = 0;
while (<IN>)
{
chomp ;
my #cols=();
#cols = split(/,/);
shift(#cols) ; #to remove the first number which is a line header
for(my $i=0; $i<11; $i++)
{
$array[$i][$j] = $cols[$i];
}
$j++;
}
CSV is not trivial. Don't parse it yourself. Use a module like Text::CSV, which will do it correctly and fast.
use strict;
use warnings;
use Text::CSV;
my #data; # 2D array for CSV data
my $file = 'something.csv';
my $csv = Text::CSV->new;
open my $fh, '<', $file or die "Could not open $file: $!";
while( my $row = $csv->getline( $fh ) ) {
shift #$row; # throw away first value
push #data, $row;
}
That will get all your rows nicely in #data, without worrying about parsing CSV yourself.
If you ever find yourself reaching for the C-style for loop, then there's a good chance that your program design can be improved.
while (<IN>) {
chomp;
my #cols = split(/,/);
shift(#cols); #to remove the first number which is a line header
push #array, \#cols;
}
This assumes that you have a CSV file that can be processed with a simple split (i.e. the records contain no embedded commas).
Aside: You can simplify your code with:
my #cols = split /,/;
Your assignment to $array[$col][$row] uses an unusual subscript order; it complicates life.
With your column/row assignment order in the array, I don't think there's a simpler way to do it.
Alternative:
If you were to reverse the order of the subscripts in the array ($array[$row][$col]), you could think about using:
use strict;
use warnings;
my #array;
for (my $j = 0; <>; $j++) # For testing I used <> instead of <IN>
{
chomp;
$array[$j] = [ split /,/ ];
shift #{$array[$j]}; # Remove the line label
}
for (my $i = 0; $i < scalar(#array); $i++)
{
for (my $j = 0; $j < scalar(#{$array[$i]}); $j++)
{
print "array[$i,$j] = $array[$i][$j]\n";
}
}
Sample Data
label1,1,2,3
label2,3,2,1
label3,2,3,1
Sample Output
array[0,0] = 1
array[0,1] = 2
array[0,2] = 3
array[1,0] = 3
array[1,1] = 2
array[1,2] = 1
array[2,0] = 2
array[2,1] = 3
array[2,2] = 1

Resources