Add all values in array for each ID in Perl - arrays

I have this table:
NAME |12/31/2016|VALUE
AAA |1/31/2017 |10
AAA |2/1/2017 |20
AAA |2/2/2017 |30
AAA |2/3/2017 |40
AAA |2/4/2017 |50
NAME |2/9/2017 |VALUE
BBB |2/10/2017 |20
BBB |2/11/2017 |30
BBB |2/12/2017 |40
BBB |2/13/2017 |50
BBB |2/14/2017 |60
and this would be my desired output:
NAME |DATE |VALUE
AAA |12/31/2016 |150
AAA |1/31/2017 |140
AAA |2/1/2017 |120
NAME |DATE |VALUE
BBB |2/9/2017 |200
BBB |2/10/2017 |180
BBB |2/11/2017 |150
What I want to do is, for each of the valid symbols, (AAA, BBB) I want to have three rows.
For the first row of each column, I want all the values added,
For example, row 1 value for AAA:
10+20+30+40+50 = 150
then for row 2 I want to just add from the second value to the last.
For example row 2 value for AAA
20+30+40+50 = 140
and so on same goes for BBB.
I want to shift the dates down so that 12/31/2016 would match AAA, then get the first three dates for each row.
I currently have this code. but this doesn't do much. it just gives me a bunch of numbers.
use strict;
use warnings;
use Scalar::Util qw(looks_like_number);
use Data::Dumper;
sub uniq {
my %seen;
grep !$seen{$_}++, #_;
}
my %cashflow;
my %fields = (
ID => 0,
DATES => 1,
VALUE => 2,
);
my #total;
my #IDs;
my #uniqueIDs;
my #dates;
my #add;
my $i = 0;
my #values;
my $counter = 3;
open( FILE, "try.CSV" );
while ( my $line = <FILE> ) {
chomp( $line );
my #lineVals = split( /\|/, $line );
if ( $lineVals[ $fields{ID} ] !~ /^SYMBOL$/i ) {
push #IDs, $lineVals[ $fields{ID} ];
}
#uniqueIDs = uniq( #IDs );
#push all CASH FLOW AMOUNTS to #cashflow
if ( looks_like_number( $lineVals[ $fields{VALUE} ] ) ) {
$lineVals[ $fields{VALUE} ] =~ s/\r//;
push #total, $lineVals[ $fields{VALUE} ];
}
if ( $lineVals[ $fields{DATES} ] =~ /(\d{1,2})\/(\d{1,2})\/(\d{4})/ ) {
$lineVals[ $fields{DATES} ] = sprintf( '%04d%02d%02d', $3, $2, $1 );
}
$cashflow{ uc $lineVals[ $fields{ID} ] }{DATES} = $lineVals[ $fields{DATES} ];
$cashflow{ uc $lineVals[ $fields{ID} ] }{VALUE} = $lineVals[ $fields{VALUE} ];
foreach my $ID ( #uniqueIDs ) {
foreach my $symb ( keys %cashflow ) {
if ( $ID = $symb ) {
if ( looks_like_number( $lineVals[ $fields{VALUE} ] ) ) {
$lineVals[ $fields{VALUE} ] =~ s/\r//;
push #total, $lineVals[ $fields{VALUE} ];
my $i = 0;
my $grand = 0;
foreach my $val ( #total ) {
while ( $i < $counter ) {
$grand += $val;
print "$grand \n";
$i++;
}
shift #total;
}
}
}
}
}
}
close FILE;
I'm really stuck with this. I don't know what to do with the problem.

A possible solution:
#!perl
use strict;
use warnings;
sub trim {
my ($str) = #_;
s!\A\s+!!, s!\s+\z!! for $str;
$str
}
my $file = 'try.CSV';
open my $fh, '<', $file or die "$0: $file: $!\n";
my ($group_name, #dates, #values);
my $sum = 0;
my $print_group = sub {
return if !defined $group_name;
my $format = " %-6s|%-11s|%s\n";
printf $format, 'NAME', 'DATE', 'VALUE';
for my $date (#dates) {
printf $format, $group_name, $date, $sum;
$sum -= shift #values if #values;
}
};
while (my $line = readline $fh) {
my ($name, $date, $value) = map trim($_), split /\|/, $line;
if ($name eq 'NAME') {
$print_group->();
$group_name = undef;
#dates = $date;
#values = ();
$sum = 0;
next;
}
$group_name ||= $name;
push #dates, $date if #dates < 3;
push #values, $value if #values < 2;
$sum += $value;
}
$print_group->();
Let's go over it.
sub trim {
my ($str) = #_;
s!\A\s+!!, s!\s+\z!! for $str;
$str
}
A helper function for removing leading/trailing whitespace from a string. We're using ! as the s delimiter here because / breaks SO's syntax highlighting. Shrug.
my $file = 'try.CSV';
open my $fh, '<', $file or die "$0: $file: $!\n";
Open our input file. Note: We use a lexical variable ($fh) instead of a bareword filehandle, and we use 3-argument open. This is strongly recommended. We also check open's return value and produce a nice error message in case of failure, including both the name of the file that couldn't be opened ($file) and the reason for failing ($!).
my ($group_name, #dates, #values);
my $sum = 0;
We set up some state variables that we want to preserve across loop iterations. $group_name is the name of the group we're currently processing, #dates is the saved dates we've seen so far, #values is the saved values we've seen so far. $sum is a running sum of all the values in the current group, and it starts at 0.
my $print_group = sub {
return if !defined $group_name;
my $format = " %-6s|%-11s|%s\n";
printf $format, 'NAME', 'DATE', 'VALUE';
for my $date (#dates) {
printf $format, $group_name, $date, $sum;
$sum -= shift #values if #values;
}
};
A helper function for printing the output for a single group. If $group_name isn't set, we haven't processed any input for the current group yet, so we do nothing and return. Otherwise we print a NAME | DATE | VALUE header, followed by a row of data for each element in #dates. For each $date we output the current group name (e.g. AAA), $date, and the sum of values (all nicely formatted using printf). Initially $sum is the sum of all group values, but after the first iteration we start subtracting the values from #values: If the list of values in the input was x1, x2, x3, x4, ..., then $sum is initially x1 + x2 + x3 + x4 + ..., and that's what's printed in the first line of output. After that we subtract x1, so the next line gets x1 + x2 + x3 + x4 + ... - x1, which is x2 + x3 + x4 + .... After that we subtract x2, so the third row of data gets x3 + x4 + ....
while (my $line = readline $fh) {
my ($name, $date, $value) = map trim($_), split /\|/, $line;
Our main loop. We read a line of input, split it on |, and trim each field.
if ($name eq 'NAME') {
$print_group->();
$group_name = undef;
#dates = $date;
#values = ();
$sum = 0;
next;
}
If $name is 'NAME', this is the start of a new group. Print the output for the current group if any ($print_group->() does nothing if there is no current group), then reset our state variables back to initial values, except for #dates, which is filled with the $date value from the header row. Then start the next iteration of the loop because we're done with this line.
$group_name ||= $name;
push #dates, $date if #dates < 3;
push #values, $value if #values < 2;
$sum += $value;
If we get here, this line is not the start of a new group. We set $group_name if it hasn't been set yet. We add $date to our list of saved dates (but we only need 3 dates, so do nothing if we already have 3). We add $value to our list of saved values (but we only need 2 of them). Finally we add $value to our total $sum within the group.
}
$print_group->();
At the end of the loop we've also just finished processing a group, so we need to call $print_group here as well.

This will do as you ask. It reads the whole data file into an array of arrays and manipulates that array before printing it. The blocks are processed backwards from the end so that the other blocks remain in place when the trailing lines are deleted
This program expects the path to the input file as a parameter on the command line and writes the result to STDOUT
use strict;
use warnings 'all';
my #data = map [ /[^|\s]+/g ], <>;
# Make a list of the indices of all the header rows
my #headers = grep { $data[$_][0] eq 'NAME' } 0 .. $#data;
# Make a list of the indices of the first
# and last lines of all the data blocks
my #blocks = map {
[
$headers[$_] + 1,
$_ == $#headers ? $#data : $headers[$_+1] - 1
]
} 0 .. $#headers;
# Shift the second column down
# Replace the col2 header with 'DATE'
#
$data[$_][1] = $data[$_-1][1] for reverse 1 .. $#data;
$data[$_][1] = 'DATE' for #headers;
# Edit each block of data
#
for my $block ( reverse #blocks ) {
my ( $beg, $end ) = #$block;
# Calculate the block total
my $total = 0;
for ( $beg ... $end ) {
$total += $data[$_][2];
}
# Calculate the first three data values
for my $i ( $beg .. $beg + 2 ) {
my $next = $total - $data[$i][2];
$data[$i][2] = $total;
$total = $next;
}
# Remove everything except those three lines
splice #data, $beg+3, $end-$beg-2;
}
print join('|', #$_), "\n" for #data;
output
NAME|DATE|VALUE
AAA|12/31/2016|150
AAA|1/31/2017|140
AAA|2/1/2017|120
NAME|DATE|VALUE
BBB|2/9/2017|200
BBB|2/10/2017|180
BBB|2/11/2017|150

Related

Array of array to manage a computing among non-adjacent lines in Perl

I'm handling some data in order to calculate a mean over samples and, when having a large difference (values over mean +/- 2 standard deviations), make an interpolation, but I need a strong help!
The hard part is to select the rows for which making the calculation respect the sample they belong to. The sample identifier is placed in the first or second or third column (the third column is not present in the example below) in a dependecy logic that is not easy to manage.
My data looks like this (directly copied and pasted from my file).
The input file is also available at https://gofile.io/?c=3PLR8m.
Columns are tab separated and each identifier has a space before the characters.
ENTITY-CODE XX YY ZZ AA BB CC att 1
/P1
^/A1/S1 143.07 124.05 -159.24 -160.53 0.39 3.31 15
^<S2 143.45 123.69 -157.19 -160.74 0.43 1.5 14.8
+A1/S1 143.87 122.84 -157.08 -147.56 -30.37 3.07 4.9
^<S2 152.09 120.29 -155.42 -145.61 -67.13 0.37 3.3
^<S3 161.5 120.13 -153.34 -134.92 -73.39 -3.93 3.4
^<S4 27.76 122.15 -152.59 -103.01 -74.37 -20 2.9
^<S5 179.58 125.71 -153.46 -90.21 -73.6 -21.68 2.8
^<S6 189.23 128.85 -152.9 -86.28 -72.54 -19.89 2.4
^<S7 196.23 135.77 -152.82 -73.48 -75.22 -19.93 2.1
^<S8 195.49 147.85 -150.64 -63.59 -80.44 -32.27 1.5
^<S3 143.07 124.1 -157.05 -145.58 -1.81 6.34 16
+A1/S1 142.03 123.41 -156.23 -72.07 -19.45 -0.4 5.5
^<S2 134.29 121.27 -153.31 -76.28 -3.92 -2.37 3.8
^<S3 128.55 119.39 -152.31 -73.1 6.95 0.04 2.7
^<S4 120.87 115.88 -150.91 -69.62 8.05 0.63 2.7
^<S5 115.31 112.83 -151.31 -76.97 7.45 -2.31 2.4
^<S6 108.54 110.71 -149.38 -86.09 5.68 -6.48 1.5
^<S4 143.49 123.63 -155.79 -175.31 14.3 12.22 13.7
+A1/S1 143.5 124.75 -155.22 175.69 25.35 25.61 5.9
^<S2 145.63 130.57 -156.39 141.67 42.19 31.94 5.3
^<S3 153.77 131.23 -153.8 71.9 34.43 20.11 3.6
^<S4 160.99 132.18 -149.31 89.71 35.44 14.31 2.6
^<S5 166.86 133.6 -146.6 93.88 34.73 11.46 1.8
+A2/S1 143.63 122.79 -155.05 65.04 4.77 -16.93 3.5
^<S6 144.71 122.02 -151.41 56.49 -7.71 -16.1 2.8
^<S6 146.83 120.14 -148.52 61.14 24.37 48.58 2.9
^<S6 154.06 115.65 -149.29 60.87 20.18 13.8 2.5
^<S5 143.32 33.32 -153.16 -127.03 8.59 9.07 12.4
^<S6 143.49 121.69 -150.07 -127.26 9.04 10.85 12.5
Basically, the rows with an identifier that follows one containing "A" in the same column need to undergo the calculation (together with the row of the "A") to check for out-of-bound values as they belong to the same sample. If, in the same column, there is another identifier containing an "A" means that is starting another group of rows which belong to another sample and need to go under another calculation.
In the example that I post here I would like a script that, starting from the first ^/A1/S1 recognizes all the rows with an identifier in the first column and make the check on the XX, YY and ZZ values they have.
The script should do the same also if the +A1/S1 identifier is in the second or other columns.
In practice, everytime there is an identifier containing an "A" means that is starting a sample which futher elements have S-type identifiers in the same column (till another A-type identifier).
The number included in the S-type identifier is not relevant. Therefore, for example, the three rows which have the same identifier (next to the end of the input example) must be considered as three separate sets of values.
The format for the output should be the same of the input with the only difference of the interpolated values changed.
The interpolation should consist into the calculation of the mean and standard deviation over a sample (rows with the identifier in the same column, from one marked by an "A" till the last before another identifier in the same column having an "A") and a check if a value is over the mean +/- 2 standard deviations (meanĀ±(2*dev.st)). If so, a value is substituted with the sample mean.
In the example here I would like to obtain the same of the input except for: the XX value in the eighth row (27.76), which should be replaced by the mean calculated on the XX values from the rows of the same sample which are
the previous and next rows (which have ^<S3 and ^<S5 respectively as identifiers in the second column) and (ii) the YY value in the thirtieth row (33.32) which should be replaced by the mean calculated on the row which have ^<S4 and ^<S6 in the first column.
Thus, this is the output I desire.
ENTITY-CODE XX YY ZZ AA BB CC att 1
/P1
^/A1/S1 143.07 124.05 -159.24 -160.53 0.39 3.31 15
^<S2 143.45 123.69 -157.19 -160.74 0.43 1.5 14.8
+A1/S1 143.87 122.84 -157.08 -147.56 -30.37 3.07 4.9
^<S2 152.09 120.29 -155.42 -145.61 -67.13 0.37 3.3
^<S3 161.5 120.13 -153.34 -134.92 -73.39 -3.93 3.4
^<S4 173.59 122.15 -152.59 -103.01 -74.37 -20 2.9
^<S5 179.58 125.71 -153.46 -90.21 -73.6 -21.68 2.8
^<S6 189.23 128.85 -152.9 -86.28 -72.54 -19.89 2.4
^<S7 196.23 135.77 -152.82 -73.48 -75.22 -19.93 2.1
^<S8 195.49 147.85 -150.64 -63.59 -80.44 -32.27 1.5
^<S3 143.07 124.1 -157.05 -145.58 -1.81 6.34 16
+A1/S1 142.03 123.41 -156.23 -72.07 -19.45 -0.4 5.5
^<S2 134.29 121.27 -153.31 -76.28 -3.92 -2.37 3.8
^<S3 128.55 119.39 -152.31 -73.1 6.95 0.04 2.7
^<S4 120.87 115.88 -150.91 -69.62 8.05 0.63 2.7
^<S5 115.31 112.83 -151.31 -76.97 7.45 -2.31 2.4
^<S6 108.54 110.71 -149.38 -86.09 5.68 -6.48 1.5
^<S4 143.49 123.63 -155.79 -175.31 14.3 12.22 13.7
+A1/S1 143.5 124.75 -155.22 175.69 25.35 25.61 5.9
^<S2 145.63 130.57 -156.39 141.67 42.19 31.94 5.3
^<S3 153.77 131.23 -153.8 71.9 34.43 20.11 3.6
^<S4 160.99 132.18 -149.31 89.71 35.44 14.31 2.6
^<S5 166.86 133.6 -146.6 93.88 34.73 11.46 1.8
+A2/S1 143.63 122.79 -155.05 65.04 4.77 -16.93 3.5
^<S6 144.71 122.02 -151.41 56.49 -7.71 -16.1 2.8
^<S6 146.83 120.14 -148.52 61.14 24.37 48.58 2.9
^<S6 154.06 115.65 -149.29 60.87 20.18 13.8 2.5
^<S5 143.32 123.41 -153.16 -127.03 8.59 9.07 12.4
^<S6 143.49 121.69 -150.07 -127.26 9.04 10.85 12.5
It has only two changes respect the input:
in row 8 (that is marked with the identifier ^<S4 in the second column) the XX value 27.76 has been replaced by the mean computed over the XX values from rows 5 to 12 (which have +A1/S1, ^<S2, ^<S3, ^<S4, ^<S5, ^<S6, ^<S7, ^<S8 in the second column as identifiers);
in row 30 (that is marked with the identifier ^<S6 in the first column) the YY value 33.32 has been replaced by the mean computed over the YY values from the rows 3, 4, 13, 20, 30 and 31 (respectively marked with the identifiers ^/A1/S1, ^<S2, ^<S3, ^<S4, ^<S5, ^<S6 in the first column).
The code I wrote so far follows. I thought of array of arrays, but I'm not sure about how to set it up.
Any suggestion is superwelcome as I'm superstuck.
Thank you!
open (HAN, "<", "$file") || die "problems with the input file";
my #lines = ();
while (<HAN>) {
chomp;
push(#lines, $_); }
#print STDERR "#lines\n";
close (HAN);
for ($lines[$i] =0; $i<=$#lines; $i++){
#columns = split (/\t/, $lines[$i]);
#print STDERR "#columns\n";
my #p;
my #s;
if (( $columns[0] ne "" ) && ( $columns[1] eq "" )){
push #p, $lines[$i] ;
#print STDERR "#p\n";
} elsif (( $columns[0] eq "" ) && ( $columns[1] ne "" )){
push #s, $lines[$i] ;
#print STDERR "#s\n";
print STDERR "#s\n";
Sorry, I don't have more time to dedicate to this. Maybe the following can help you find the correct way.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use Syntax::Construct qw{ // };
use List::Util qw{ sum };
my #data;
while (<>) {
chomp;
push #data, [ split /\t/ ];
}
my (#dev_st, #mean);
for my $line_index (0 .. $#data) {
for my $column (2, 3) {
for my $level (0, 1) {
if (($data[$line_index][$level] // "") =~ /A/) {
my $to = $line_index;
my $inner_group;
do { ++$to } until $to > $#data
|| $level == 1 && $data[$to][0]
|| (($data[$to][$level] // "") =~ /A/
and $inner_group = 1);
--$to if $inner_group;
my #group_data = map $data[$_][2],
grep $data[$_][$level],
$line_index .. $to;
$mean[$level] = sum(#group_data) / #group_data;
$dev_st[$level] = sqrt(1/(#group_data - 1) * sum(
map { ($_ - $mean[$level]) ** 2 } #group_data));
# warn "$line_index: #group_data\n$mean[$level] $dev_st[$level]\n";
}
}
my $value = $data[$line_index][$column] // "";
next unless $value =~ /-?[0-9]+(?:\.[0-9]+)?/;
my ($level) = grep $data[$line_index][$_], 0, 1;
if ( $value > $mean[$level] + 2 * $dev_st[$level]
|| $value < $mean[$level] - 2 * $dev_st[$level]
) {
$data[$line_index][$column]
= sprintf '%.2f', $mean[$level];
}
}
say join "\t", map $_ // "", #{ $data[$line_index] };
}
print "\n";
Here is another example of how you could approach it:
package Main;
use feature qw(say);
use strict;
use warnings;
my $self = Main->new( fn => 'in.mtg', save_fn => 'out.mtg');
$self->read_file();
$self->calc_mean();
$self->calc_std();
$self->do_subst();
$self->write_file();
say "Done";
sub do_subst {
my ( $self ) = #_;
for my $i (0..2) {
my $var = $self->{vars}[$i];
my $mean = $self->{mean}[$i];
my $std = $self->{std}[$i];
for my $col_no (0..1) {
my $col = $self->{col}[$col_no];
$self->do_subst_col( $col, $var, $mean->[$col_no], $std->[$col_no] );
}
}
}
sub do_subst_col {
my ( $self, $col, $var, $mean, $std ) = #_;
my $mean_cur;
my $std_cur;
my $k = 0;
for my $i (0..$#$col) {
my $id = $col->[$i];
next if $id !~ /\S/; # No identifier in this row
if ( $id =~ /A/ ) {
$k++ if defined $mean_cur;
$mean_cur = $mean->[$k];
$std_cur = $std->[$k];
}
if ( ($var->[$i] < ($mean_cur - 2*$std_cur))
|| ($var->[$i] > ($mean_cur + 2*$std_cur)) ) {
$var->[$i] = $mean_cur;
}
}
}
sub calc_std {
my ( $self ) = #_;
my #std;
for my $i (0..2) {
push #std, $self->calc_std_var( $i );
}
$self->{std} = \#std;
}
sub calc_std_var {
my ( $self, $i ) = #_;
my $mean = $self->{mean}[$i];
my $var = $self->{vars}[$i];
my #std;
for my $col_no (0..1) {
my $col = $self->{col}[$col_no];
push #std, $self->calc_std_col( $col, $var, $mean->[$col_no] );
}
return \#std;
}
sub calc_std_col {
my ( $self, $col, $var, $mean ) = #_;
my #std;
my $sum;
my $N;
my $k = 0;
my $mean_cur = $mean->[$k];
for my $i (0..$#$col) {
my $id = $col->[$i];
next if $id !~ /\S/; # No identifier in this row
if ( $id =~ /A/ ) {
if (defined $sum) {
push #std, sqrt($sum/$N);
$k++;
}
$sum = 0; $N = 0;
$mean_cur = $mean->[$k];
}
$sum += ($var->[$i] - $mean_cur)**2;
$N++;
}
push #std, sqrt($sum/$N);
return \#std;
}
sub calc_mean {
my ( $self ) = #_;
my #mean;
for my $i (0..2) {
push #mean, $self->calc_mean_var( $i );
}
$self->{mean} = \#mean;
}
sub calc_mean_var {
my ( $self, $i ) = #_;
my $var = $self->{vars}[$i];
my #mean;
for my $col_no (0..1) {
my $col = $self->{col}[$col_no];
push #mean, $self->calc_mean_col( $col, $var );
}
return \#mean;
}
sub calc_mean_col {
my ( $self, $col, $var ) = #_;
my #mean;
my $sum;
my $N;
for my $i (0..$#$col) {
my $id = $col->[$i];
next if $id !~ /\S/; # No identifier in this row
if ( $id =~ /A/ ) {
push #mean, $sum/$N if defined $sum;
$sum = 0; $N = 0;
}
$sum += $var->[$i];
$N++;
}
push #mean, $sum/$N;
return \#mean;
}
sub new {
my ( $class, %args ) = #_;
my $self = bless \%args, $class;
return $self;
}
sub read_file {
my ( $self ) = #_;
my $fn = $self->{fn};
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my $line = $self->read_header( $fh );
my #col1; my #col2; my #xx; my #yy; my #zz; my #rest;
while (1) {
chomp $line;
my #F = split "\t", $line;
die "Bad file." if #F != 9;
push #col1, $F[0];
push #col2, $F[1];
push #xx, $F[2];
push #yy, $F[3];
push #zz, $F[4];
push #rest, join "\t", #F[5..8];
$line = <$fh>;
last if !defined $line;
}
close $fh;
$self->{col} = [\#col1, \#col2];
$self->{vars} = [\#xx, \#yy, \#zz];
$self->{rest} = \#rest;
$self->{N} = scalar #col1;
}
sub read_header {
my ( $self, $fh ) = #_;
my $line;
my #header;
while (1) {
$line = <$fh>;
die "Bad file." if !defined $line;
last if $line =~ m{^ \^/A1/S1};
push #header, $line;
}
$self->{header} = \#header;
return $line;
}
sub write_file {
my ( $self ) = #_;
my $fn = $self->{save_fn};
open ( my $fh, '>', $fn ) or die "Could not open file '$fn': $!";
print {$fh} join "", #{ $self->{header} };
my $N = $self->{N};
my $col1 = $self->{col}[0];
my $col2 = $self->{col}[1];
my $xx = $self->{vars}[0];
my $yy = $self->{vars}[1];
my $zz = $self->{vars}[2];
my $rest = $self->{rest};
for my $i (0..($N - 1)) {
say {$fh} join "\t", $col1->[$i], $col2->[$i], $xx->[$i],
$yy->[$i], $zz->[$i], $rest->[$i];
}
close $fh;
}

Perl: Inserting values into specific columns of CSV file

I have CSV data of the form:
S.No,Label,Customer1,Customer2,Customer3...
1,label1,Y,N,Y
2,label2,N,Y,N
...
I need to reproduce the "label" to the left of "customer" columns marked with Y - and have nothing ("") to the left of columns marked with N.
Expected output:
S.No,Label,Customer1,Customer1,Customer2,Customer2,Customer3,Customer3...
1,label1,label1,Y,"",N,label1,Y
2,label2,"",N,label2,Y,"",N
When opened using Excel, it would look like this:
S.No Label Customer1 Customer1 Customer2 Customer2 Customer3 Customer3...
1 label1 label1 Y N label1 Y
2 label2 N label2 Y N
The two leftmost columns, referring to S.No and the original "Label" column, are constant.
What is the simplest way to do this? I tried the following code:
use strict;
use warnings;
my $nonIncludesFile = "nonIncludes.csv";
open(my $xfh, "+>", $nonIncludesFile) or warn "Unable to open $nonIncludesFile, $!";
chomp( my $header = <$xfh> );
my #names = split ",", $header;
my #names1;
my #fields;
my #fields1;
for(my $j=0; $j< scalar(#names); $j++)
{
$names1[$j] = $names[$j];
}
while(<$xfh>)
{
my $nonIncLine = $_;
$nonIncLine = chomp($nonIncLine);
#fields = split ",", $nonIncLine;
next if $. == 1; #skip the first line
for(my $i = 0; $i < scalar(#fields) -2; $i++) #Number of "customers" = scalar(#fields) -2
{
$fields1[0] = $fields[0];
$fields1[1] = $fields[1];
if('Y' eq $fields[ $i + 2 ])
{
$fields1[$i+2] = 'Y';
substr(#fields1, $i + 1, 0, $fields[1]); #insert the label to the left - HERE
}
else
{
$fields1[$i+2] = 'N';
substr(#fields1, $i + 1, 0, "");
}
}
}
print $xfh #names1;
print $xfh #fields1;
close($xfh);
This however complains of "substr outside of string" at the line marked by "HERE".
What am I doing wrong? And is there any simpler (and better) way to do this?
Something like this maybe?
#!/usr/bin/perl
use strict;
use warnings;
#read the header row
chomp( my ( $sn, $label, #customers ) = split( /,/, <DATA> ) );
#double the 'customers' column headings (one is suffixed "_label")
print join( ",", $sn, $label, map { $_ . "_label", $_ } #customers ), "\n";
#iterate data
while (<DATA>) {
#strip trailing linefeed
chomp;
#extract fields with split - note breaks if you've quoted commas inline.
my ( $sn, $label, #row ) = split /,/;
print "$sn,$label,";
#iterate Y/N values, and either prints "Y" + label, or anything else + blank.
foreach my $value (#row) {
print join( ",", $value eq "Y" ? $label : "", $value ),",";
}
print "\n";
}
__DATA__
S.No,Label,Customer1,Customer2,Customer3
1,label1,Y,N,Y
2,label2,N,Y,N
Assumes you don't have any fruity special characters (e.g. commas) in the fields, because it'll break if you do, and you might want to consider Text::CSV instead.
It is always much better to post some usable test data than write a something like this question
However, it looks like your data has no quoted fields or escaped characters, so it looks like you can just use split and join to process the CSV data
Here's a sample Perl program that fulfils your requirement. The example output uses your data as it is. Each line of data has to be processed backwards so that the insertions don't affect the indices of elements that are yet to be processed
use strict;
use warnings 'all';
use feature 'say';
while ( <DATA> ) {
chomp;
my #fields = split /,/;
for ( my $i = $#fields; $i > 1; --$i ) {
my $newval =
$. == 1 ? $fields[$i] :
lc $fields[$i] eq 'y' ? $fields[1] :
'';
splice #fields, $i, 0, $newval;
}
say join ',', #fields;
}
__DATA__
S.No,Label,Customer1,Customer2,Customer3...
1,label1,Y,N,Y
2,label2,N,Y,N
output
S.No,Label,Customer1,Customer1,Customer2,Customer2,Customer3...,Customer3...
1,label1,label1,Y,,N,label1,Y
2,label2,,N,label2,Y,,N

Transform/pivot array in perl

Im stuck writing Perl code which transforms 2d array.
First column of the array is always date
Second column of the array is key that sorts.
Data is located in array "data" and is ordered by date and then key.
My situation should be understandable from the tables under. Unique values from the second column will be selected and later divided into columns header (green table)
It should work with and number of columns or dates/keys.
Structure before
Structure after
My code:
#creates filtered array of all the unique dates and its count
my #date = #{ $data->[0] };
my #filtDate = uniq #date;
my $countFiltDate = scalar #filtDate;
#unique list of keys
my #klic = #{ $data->[1] };
my #filtKlic = uniq #klic;
#orders filtered keys
#filtKlic = sort #filtKlic;
my $countFiltKlic = scalar #filtKlic;
#count of columns
my $columnsCount = scalar #{$data};
#test code - counts how many new number columns to make.
my $columnsCountAfter = ( $columnsCount - 2 ) * $countFiltKlic;
#inserst filtered dates into first column
my $dataGraph;
for ( my $i = 0; $i < $countFiltDate; $i++ ) {
$dataGraph->[0]->[$i] = #filtDate[$i];
}
#biggest loop with number of dates
for ( my $k = 0; $k < $countFiltDate; $k++ ) {
my $l;
my $c;
#columns sount k $i
for ( my $i = 0; $i < $columnsCount - 2; $i++ ) {
#loop for different keys k $j
for ( my $j = 0; $j < $countFiltKlic; $j++ ) {
$l++; #riadok v prvej tabulke
#EVERYTHING after this part is written horibly.
# I'm trying to make it work even
#if key values are missing.
for ( my $m = 0; $m < 5; $m++ ) {
if ( $data->[1]->[ $l - 1 ] eq $filtKlic[$j] ) {
print " [" . $data->[1]->[ ( $l - 1 ) ] . ',' . $filtKlic[$j] . "]";
$dataGraph->[ $l + $c ]->[$k] = $data->[ $i + 2 ]->[ ( ( $k * $countFiltKlic ) + $j ) ];
#print " [".$data->[1]->[$j].','.($filtKlic[($j)])."]-";
print " [" . ( $i + 2 ) . ',' . ( ( $k * $countFiltKlic ) + $j ) . "]-";
print " [" . ( $l + $c ) . ',' . $k . "]<br>";
$m = 5; #just random number... i don't want to get infinite loops during testing
} else {
if ( $m = 5 ) {
$l--;
$c++;
}
$j++;
}
}
}
}
}
my #nameStlpceKlic;
#nameStlpceKlic[0] = "date";
my $o;
for ( my $i = 0; $i < $columnsCount - 2; $i++ ) {
foreach (#filtKlic) {
my $o;
$o++;
#nameStlpceKlic[$o] = #filtKlic[ ( $o - 1 ) ] . $i;
}
}
I have 2 problems.
How to make sure that this will work even if some of the key are missing at some dates.
How to write it properly. My code is too clumsy.
Here is my general approach for solving this kind of problem.
In the second table, you're grouping your data by the date, then displaying the values for number1 and the values for number2. This should give you a hint as to how you want to organise your data structure and what you need to index for printing.
Your current data is (I assume) stored in an array of arrays. I was too lazy to copy the values, so I made my own AoA with made up values. I've sprinkled comments through the code so you can see how I worked on this.
my $arr = [
['date','key','number1','number2'],
['22.12.2013','1a','1a1-34567','1a2-1234567'],
['22.12.2013','2b','2b1-3249871','2b2-4597134'],
['22.12.2013','3c','3c1-1234567',''],
['22.12.2013','4d','4c1-3249871','4c2-4597134'],
['22.13.2013','1a','1a1-34567','1a2-1234567'],
['22.13.2013','2b','','2b2-4597134'],
['22.13.2013','3c','3c1-1234567','3c2-1234567'],
['22.13.2013','4d','4c1-3249871','4c2-4597134'],
];
# first, remove the first row, which contains the column headers.
my $col_h = shift #$arr;
my $data;
my $key_list;
foreach (#$arr) {
my %hash;
# use a hash slice with the column header array as keys
# and the array as the values
#hash{#$col_h} = #$_;
# store this hash in a data hash indexed by date then key
$data->{ $hash{date} }{ $hash{key} } = \%hash;
# compile a separate hash with the keys in it
$key_list->{ $hash{key} }++;
}
# make a sorted list of keys, ready for printing
my #key_list = sort keys %$key_list;
# remove the first two elements from the column headers ('date' and 'key')
splice(#$col_h, 0, 2);
# print out the header row for the table (I'm doing a simple tab-delim'd table)
print STDERR "Date\t";
# for each NUMBER from NUMBER1 ... NUMBERn
foreach my $c (#$col_h) {
# print "keyID NUMBERn"
map { print STDERR "$_ $c\t" } #key_list;
}
print STDERR "\n";
# Now print out the data itself. Sort by date...
foreach my $date (sort keys %$data) {
print STDERR "$date\t";
# for each NUMBER1 ... NUMBERn
foreach my $header (#$col_h) {
foreach my $key (#key_list) {
## print out the value OR - if there is no value
print STDERR ( $data->{$date}{$key}{$header} || "-" ) . "\t";
}
}
print STDERR "\n"; # end of the table row
}
Output (with tabs expanded for display purposes):
Date 1a number1 2b number1 3c number1 4d number1 1a number2 2b number2 3c number2 4d number2
22.12.2013 1a1-34567 2b1-3249871 3c1-1234567 4c1-3249871 1a2-1234567 2b2-4597134 - 4c2-4597134
22.13.2013 1a1-34567 - 3c1-1234567 4c1-3249871 1a2-1234567 2b2-4597134 3c2-1234567 4c2-4597134
I was able to put together code that works using great answer from "i alarmed alien" .
First thing that is different is that my data are formatted as array of arrays in transposed way.
$arr1 = [ '2013-12-22', '2013-12-22' ];
$arr2 = [ 'Number1','Number2'];
$arr3 = [ '2328942', '679204'];
$arr4 = [ '1450398', '436713'];
Also transformed data should be saved in an array. I've written this piece of code. ( It's far from perfect, if there are any suggestions how to improve it further I'd be happy to hear those.)
####################
#transpose data
my $datas = $args{DATA};
my $headers = $args{HEADERS};
my #rows = ();
my #transposed = ();
for my $row (#$datas) {
for my $column (0 .. $#{$row}) {
push(#{$transposed[$column]}, $row->[$column]);
}
}
#################################
my #arr = #transposed;
# first, define headers.
my $col_h = $args{HEADERS};
my $data;
my $key_list;
foreach (#arr) {
my %hash;
# use a hash slice with the column header array as keys
# and the array as the values
#hash{#$col_h} = #$_;
# store this hash in a data hash indexed by date then key
$data->{ $hash{date} }{ $hash{key} } = \%hash;
# compile a separate hash with the keys in it
$key_list->{ $hash{key} }++;
}
# make a sorted list of keys, ready for printing
my #key_list = sort keys %$key_list;
# remove the first two elements from the column headers ('date' and 'key')
splice(#$col_h, 0, 2);
my #output;
my #header;
# print out the header row for the table (I'm doing a simple tab-delim'd table)
#print STDERR "Date\t";
push(#header, "Date\t");
# for each NUMBER from NUMBER1 ... NUMBERn
foreach my $c (#$col_h) {
# print "keyID NUMBERn"
map { push (#header,"$_ $c\t" )} #key_list;
#map { print STDERR "$_ $c\t" } #key_list;
}
#print STDERR "<br>";
push (#output,\#header );
my $row;
my $column;
# Now print out the data itself. Sort by date...
foreach my $date (sort keys %$data) {
#print STDERR "$date\t";
$row++;
my #line;
push(#line, "$date");
# for each NUMBER1 ... NUMBERn
foreach my $header (#$col_h) {
foreach my $key (#key_list) {
## print out the value OR - if there is no value
$column++;
push (#line,( $data->{$date}{$key}{$header} || "-" ) . "\t");
#print STDERR ( $data->{$date}{$key}{$header} || "-" ) . "\t";
}
}
print STDERR "<br>"; # end of the table row
$column = 0;
push (#output,\#line );
}
my $x = 1;
return #output;
}
This code works but it's little ugly. Please let me know If there is cleaner/better way to do this.

Pulling out potentially overlapping subsets of elements in array to make smaller arrays

My input file looks like below (real one is much larger):
rs3683945_mark 0
rs6336442_mark 1E-07
rs31328150_impute 0.444121193
rs3658242_mark 0.444121293
rs39342374_impute 0.444121393
IMP!1! 1
rs3677817_mark 1.986015679
IMP!2! 2
SNP117_impute 2.685815665
IMP!3! 3
SNP3_1_impute 3.643119709
SNP1_impute 3.643119809
rs13475706_mark 3.643119909
13 lines, two elements each line. First element is a name. Each name ends either with a "tag" _mark or impute, or there is no tag. The point of the tag is to distinguish between types of names, which form the basis of my search for subsets within the entire list.
The subsets begin with a _mark name that immediately precedes an instance of an _impute name. The subsets end with the very next instance of _mark. All names in between, which will necessarily not have any such tag, also go into a subset, which I'd like to collect into an array and send off to a subroutine to process (details of that not important). Please note, the positions with IMP in the name are not the same as those actually tagged with a _impute.
For example, with the above, the first useable subset is:
rs6336442_mark 1E-07
rs31328150_impute 0.444121193
rs3658242_mark 0.444121293
The second useable subset is:
rs3658242_mark 0.444121293
rs39342374_impute 0.444121393
IMP!1! 1
rs3677817_mark 1.986015679
and so on... EDIT: Note that last _mark name of the first set is the first _mark name of the second.
My code for this:
#!/usr/bin/perl
use strict; use warnings;
my $usage = "usage: merge_impute.pl {genotype file} {distances file} \n";
die $usage unless #ARGV == 2;
my $genotypes = $ARGV[0];
open (FILE, "<$genotypes");
my #genotypes = <FILE>;
close FILE;
my $distances = $ARGV[1];
open (DISTS, "<$distances");
my #distances = <DISTS>;
close DISTS;
my #workingset = ();
#print scalar #distances;
for ( my $i = 0; $i < scalar #distances; $i++ ){
chomp $distances[$i];
#print "$distances[$i]\n";
if ( $distances[$i] =~ m/impute/ ){
push ( #workingset,$distances[$i-1],$distances[$i],$distances[$i+1]);
}
print "i=$i: #workingset\n";
# at this point send off to sub routine
#workingset=();
}
As you can see, the if loop is only set up to find subsets that contain only one _impute name. How can I modify the code so that a subset will "fill up" with as many names as required until we arrive at the next _mark name?
EDIT: Perhaps instead of the for loop, I could something like...
push (#workingset, $distances[0], $distances[1] );
until ( $distance[ ??? ] =~ m/_mark/ ){
push ( #workingset, $distance[ ??? ] );
}
But what could $distances[ ??? ] be?
EDIT: Or an alternative for loop...
push (#workingset, $distances[0] );
for ( my $i = 1; $i < scalar #distances - 1 ; $i++ ){
until ( $distances[ $i ] =~ m/_mark/ ){
push ( #workingset, $distances[ $i ] );
# send #workingset to sub routine
#clear workingset
#workingset = ();
}
}
Though this isn't working.
I also tried...
push (#workingset, $distances[0] );
for ( my $i = 1; $i < scalar #distances - 1 ; $i++ ){
until ( $distances[ $i ] =~ m/_mark/ ){
push ( #workingset, $distances[ $i ] );
next if $distances[ $i+1 ] !~ /_mark/;
}
# send #workingset to sub routine here
print "i=$i, #workingset\n\n";
#clear workingset
#workingset = ();
}
I don't have a lot of time right now but I'll hopefully have some time in the morning to check back. Here's a quick example on how you could do it (it is meant to be simple and easy to understand, not fancy). Hopefully it helps you get on the right track for parsing the data.
use strict;
use warnings;
my $first_mark;
my #workingset = ();
my $second_mark;
while (<DATA>){
chomp;
if ( /_mark/ and scalar #workingset == 0 ) {
$first_mark = $_;
} elsif ( /IMP|_impute/ and defined $first_mark) {
push #workingset, $_;
} elsif ( /_mark/ and defined $first_mark) {
$second_mark = $_;
print "Found valid set: ";
print "$first_mark," . join(",", #workingset) . ",$second_mark\n";
#workingset = ();
$first_mark = $second_mark;
undef $second_mark;
}
}
__DATA__
rs3683945_mark 0
rs6336442_mark 1E-07
rs31328150_impute 0.444121193
rs3658242_mark 0.444121293
rs39342374_impute 0.444121393
IMP!1! 1
rs3677817_mark 1.986015679
IMP!2! 2
SNP117_impute 2.685815665
IMP!3! 3
SNP3_1_impute 3.643119709
SNP1_impute 3.643119809
rs13475706_mark 3.643119909
Output:
Found valid set: rs6336442_mark 1E-07,rs31328150_impute 0.444121193,rs3658242_mark 0.444121293
Found valid set: rs3658242_mark 0.444121293,rs39342374_impute 0.444121393,IMP!1! 1,rs3677817_mark 1.986015679
Found valid set: rs3677817_mark 1.986015679,IMP!2! 2,SNP117_impute 2.685815665,IMP!3! 3,SNP3_1_impute 3.643119709,SNP1_impute 3.643119809,rs13475706_mark 3.643119909

Alter code to sum values from multiple files

Curious if I can get a little help here. I'm a perl newbie, and can't figure out how to convert the following code into something a bit more useful for my analysis.
This code presently takes the 1st and 4th column from a user supplied list of data files and puts them together.
What I'd like my code to do, for each row of the "current output" generated by this code (see below), is make a sum of these 4th column values (filea, fileb, filec). Not quite sure how to implement this...
Current Output:
filea fileb filec
entrya | 0 |10.2 | 0
entryb | 0 | 0.0 | 1
entryc | 8 | 57.0| 46
desired output
sum
entrya | 10.2
entryb | 1
entryc | 111
current code looks like this:
main: {
my %data;
foreach my $file (#rsem_files) {
open (my $fh, $file) or die "Error, cannot open file $file";
my $header = <$fh>; # ignore it
while (<$fh>) {
chomp;
my #x = split(/\t/);
my $acc = $x[0];
my $count = $x[4];
$data{$acc}->{$file} = $count;
}
close $fh;
}
my #filenames = #rsem_files;
foreach my $file (#filenames) {
$file = basename($file);
}
print join("\t", "", #filenames) . "\n";
foreach my $acc (keys %data) {
print "$acc";
foreach my $file (#rsem_files) {
my $count = $data{$acc}->{$file};
unless (defined $count) {
$count = "NA";
}
print "\t$count";
}
print "\n";
}
exit(0);
}
Alter the #rsemfiles loop:
# create $total variable outside loop
my $total = 0;
foreach my $file (#rsem_files) {
my $count = $data{$acc}->{$file};
# change unless to if, no need for NA
if (defined $count) {
$total += $count;
}
}
# move print outside loop so it happens once instead of per-file
print '\t$total\n';
foreach $line(#rsemfiles) {
if ($line=~ /^entry/) {
#match the line starting with the word entry
my $entry=$1; my $filea=$2; my $fileb=$3; my $filec=$4;
# make variables out of the column values
Now that you have these variables, you can do math on them.

Resources