Import variable time series data in SQL Server database - sql-server

I've looked for a solution to this here on stack overflow and elsewhere, but can't find examples that deal with the volume I have to work with. If I have missed a solution to this that has been posted elsewhere, I'd be very grateful if someone could point me into the right direction.
I'm trying to import time series data from 45 different Excel worksheets (about 5 per Excel workbook). Each worksheet contains commodity price series, covering several years of daily prices for each commodity.
The raw Excel data has one row for each day for which prices might exist and one column for each commodity contract, which typically is a monthly future contract. The points for each contract hence are at least 30 each (but not much more) while the entire table has several thousand rows and 100+ columns.
I was able to build a SSIS package that reads the data and using unpivot, transforms the matrix into row based records with columns for:
Date, Price, Contract
The problem however is that in the unpivot transform I have to manually specify the destination column for each transformed input column. So 45 worksheets, each containing 100+ columns (some even several hundred) for contracts I'd be ending up 'hard-coding' those transforms manually for the next few days... On top, this is not as flexible/re-usable as I was hoping.
Example of the raw data attached (the entire Cocoa worksheet contains 9724 rows and 195 columns)
Here's how the unpivot for another single commodity is configured. The 'Destination Column' has to be filled in manually row-by-row.
I'm hoping that I have just missed the right steps in the unpivot configuration to make these columns dynamic. Ideally the SSIS solution can be re-used with equally formatted Excel workbooks later again. It's not required to run this on the Server as it's not a frequently recurring thing but rather once or twice per year max. So i can easily kick this off manually from within VS.
I'm trying to help an academic researcher who would otherwise spend a massive amount of time cleaning and analysing the data manually in Excel.

First you need to design and create the tables that are going to receive the data and experiment with some manual data entry to check the data model.
Make sure each spreadsheet has enough header information to know how to process the rows.
When that is done, I would save the sheets to text files with tab delimiters.
Next I would write a loading program in Perl. It reads the header rows first and determines the rules for inserting the rows into the database. Then each row gets converted into an insert into the database.
Here is an example from an invoice loading program I own (all rights):
if ($first) {
$obj->_hdr2keys(0); # convert spreadhseet header into a lookup
my $hdr = $obj->_copystruct($obj->{ar}[0]);
my #Hhdr = ('invoice header id');
my #Hcols = ('invhid');
my #Htypes = ('serial');
my #Dhdr = ('invoice detail id');
my #Dcols = ('invdid','invhid');
my #Dtypes = ('serial','integer');
for (my $col=0; $col <= $#{$hdr}; $col++) {
my $colname = lc($obj->_pomp($hdr->[$col]));
if ($colname eq 'invoicenumber') {
push #Hhdr, $hdr->[$col];
push #Hcols, $colname;
push #Htypes, 'char(32)';
}
elsif ($colname eq 'buysell') {
push #Hhdr, $hdr->[$col];
push #Hcols, $colname;
push #Htypes, 'boolean';
}
elsif ($colname eq 'suppliercustomer') {
push #Hhdr, $hdr->[$col];
push #Hcols, $colname;
push #Htypes, 'char(64)';
}
elsif ($colname eq 'date') {
push #Hhdr, 'Transaction Date';
push #Hcols, 'transactiondate';
push #Htypes, 'date';
}
elsif ($colname eq 'article') {
push #Dhdr, 'Article id';
push #Dcols, 'artid';
push #Dtypes, 'integer';
push #Dhdr, 'Article Description';
push #Dcols, 'description';
push #Dtypes, 'char(64)';
}
elsif ($colname eq 'qty') {
push #Dhdr, $hdr->[$col];
push #Dcols, $colname;
push #Dtypes, 'integer';
}
elsif ($colname eq 'priceexclbtw') {
push #Dhdr, $hdr->[$col];
push #Dcols, $colname;
push #Dtypes, 'double precision';
}
elsif ($colname eq 'btw') {
push #Dhdr, $hdr->[$col];
push #Dcols, $colname;
push #Dtypes, 'real';
}
}
$obj->_getset('INVHar',
['invoiceheader',
['PK','invhid'],
['__COLUMNS__'],
\#Hcols,
\#Htypes,
\#Hhdr
]
);
$obj->_getset('INVDar',
['invoicedetail',
['PK','invdid'],
['FK','invhid','invoiceheader','invhid'],
['FK','artid','article','artid'],
['__COLUMNS__'],
\#Dcols,
\#Dtypes,
\#Dhdr
]
);
}
$first = 0;
SALESROW: for (my $i=1; $i <= $#{$obj->{ar}}; $i++) {
my #Hrow = ('');
my #Drow = ('');
my $date = $obj->_selectar('', $i, 'Date');
$date =~ s/\-/\//g;
if ($date) {
$obj->_validCSV('date', $date)
or die "CSV format error date |$date| in file $file";
}
my $invtotal = ($obj->_selectar('', $i, 'Invoice Total incl. BTW'));
my $article = $obj->_selectar('', $i, 'Article');
$date or $article or next SALESROW;
if ($date) {
push #Hrow, $obj->_selectar('', $i, 'Invoice Number');
my $buysell = $obj->_selectar('', $i, 'Buy/Sell');
push #Hrow, ($buysell eq 'S') ? 1 : 0;
push #Hrow, $obj->_selectar('', $i, 'Supplier/Customer');
push #Hrow, $date;
push #{$obj->_getset('INVHar')}, \#Hrow;
$invhid++;
}
push #Drow, $invhid;
if ($article eq 'E0154') {
push #Drow, 1;
}
elsif ($article eq 'C0154') {
push #Drow, 2;
}
elsif ($article eq 'C0500') {
push #Drow, 3;
}
elsif ($article eq 'C2000') {
push #Drow, 4;
}
elsif ($article eq 'C5000') {
push #Drow, 5;
}
else {
die "unrecognised sales article $article\n"
. Dumper($obj->{ar}[$i]);
}
push #Drow, undef; # description is in article table
push #Drow, $obj->_selectar('', $i, 'Qty.');
push #Drow, $obj->_selectar('', $i, 'Price excl. BTW');
push #Drow, $obj->_selectar('', $i, 'BTW %');
push #{$obj->_getset('INVDar')}, \#Drow;
}
This creates header and detail records for invoices after the product table has already been loaded from another spreadsheet.
In the above example two array of arrays are created, INVHar and INVDar. When they are ready, the calling routine loads them into the database as follows. In this next code example the tables are created as well as the rows and also a metadb is updated for loading of future tables and managing foreign keys for existing tables. The array created in the previous snippet contains all information needed to create the table and insert the rows. There is also a simple routine _DBdatacnv that converts between the formats in the spreadsheet and the formats needed in the database. For example, the spreadsheet had currency symbols that need to be stripped before insertion.
sub _arr2db {
my ($obj) = #_;
my $ar = $obj->_copystruct($obj->_getset('ar'));
my $dbh = $obj->_getset('CDBh');
my $mdbh = $obj->_getset('MDBh');
my $table = shift #$ar;
$mdbh->{AutoCommit} = 0;
$dbh->{AutoCommit} = 0;
my #tables = $mdbh->selectrow_array(
"SELECT id FROM mtables
WHERE name = \'$table\'"
);
my $id = $tables[0] || '';
if ($id) {
$mdbh->do("DELETE FROM mcolumns where tblid=$id");
$mdbh->do("DELETE FROM mtables where id=$id");
}
# process constraints
my %constraint;
while ($#{$ar} >= 0
and $ar->[0][0] ne '__COLUMNS__') {
my $cts = shift #$ar;
my $type = shift #$cts;
if ($type eq 'PK') {
my $pk = shift #$cts;
$constraint{$pk} ||= '';
$constraint{$pk} .= ' PRIMARY KEY';
#$cts and die "unsupported compound key for $table";
}
elsif ($type eq 'FK') {
my ($col, $ft, $fk) = #$cts;
$ft && $fk or die "incomplete FK declaration in CSV for $table";
$constraint{$col} ||= '';
$constraint{$col} .=
sprintf( ' REFERENCES %s(%s)', $ft, $fk );
}
elsif ($type eq 'UNIQUE') {
while (my $uk = shift #$cts) {
$constraint{$uk} ||= '';
$constraint{$uk} .= ' UNIQUE';
}
}
elsif ($type eq 'NOT NULL') {
while (my $nk = shift #$cts) {
$constraint{$nk} ||= '';
$constraint{$nk} .= ' NOT NULL';
}
}
else {
die "unrecognised constraint |$type| for table $table";
}
}
shift #$ar;
unless ($mdbh->do("INSERT INTO mtables (name) values (\'$table\')")) {
warn $mdbh->errstr . ": mtables";
$mdbh->rollback;
die;
}
#tables = $mdbh->selectrow_array(
"SELECT id FROM mtables
WHERE name = \'$table\'"
);
$id = shift #tables;
$dbh->do("DROP TABLE IF EXISTS $table CASCADE")
or die $dbh->errstr;
my $create = "CREATE TABLE $table\n";
my $cols = shift #$ar;
my $types = shift #$ar;
my $desc = shift #$ar;
my $first = 1;
my $last = 0;
for (my $i=0; $i<=$#{$cols}; $i++) {
$last = 1;
if ($first) {
$first = 0;
$create .= "( "
}
else {
$create .= ",\n";
}
$create .= $cols->[$i]
. ' ' . $obj->_DBcnvtype($types->[$i]);
$constraint{$cols->[$i]}
and $create .= ' ' . $constraint{$cols->[$i]};
unless ($mdbh->do("INSERT INTO mcolumns (tblid,name,type,description)
values ($id,\'$cols->[$i]\',\'$types->[$i]\',\'$desc->[$i]\')"))
{
warn $mdbh->errstr;
$mdbh->rollback;
die;
}
}
$last and $create .= ')';
unless ($dbh->do($create)) {
warn $dbh->errstr;
$dbh->rollback;
die;
}
my $count = 0;
while (my $row = shift #$ar) {
$count++;
my $insert = "INSERT INTO $table (";
my $values = 'VALUES (';
my $first = 1;
for (my $i=0; $i<=$#{$cols}; $i++) {
my $colname = $cols->[$i];
unless (defined($constraint{$colname})
and $constraint{$colname} =~ /PRIMARY KEY/) {
if ($first) {
$first = 0;
}
else {
$insert .= ', ';
$values .= ', ';
}
$insert .= $colname;
my $val = $obj->_DBdatacnv('CSV', 'DB',
$types->[$i],$row->[$i]);
if ($val eq '%ABORT') {
$mdbh->rollback;
die;
}
$values .= $val;
}
}
$insert .= ')' . $values . ')';
unless ($dbh->do($insert)) {
warn $dbh->errstr;
warn $insert;
$mdbh->rollback;
die;
}
}
NOINSERT: $mdbh->commit;
$dbh->commit;
# warn "inserted $count rows into $table";
}
Update: ok I'll add the generic routine that converts from CSV to array ready for _arr2db above for all other cases I have for a system: The spreadsheets are first augmented with PK FK and other constraints followed by a header of column names for the database, a row of the database types (notional, actual are taken care of in _DBcnvdatatype) then a row of tags that go in the metadatabase and finally a token COLUMNS just before the rows of data to insert.
sub _csv2arr {
my ($obj, $csv ) = #_;
my $ar = [];
my $delim = $obj->_getset('csvdelim') || '\,';
my $basename = basename($csv);
$basename =~ s/\.txt$//;
$ar = [$basename];
open my $fh, $csv
or die "$!: $csv";
while (<$fh>) {
chomp;
my $sa = [];
#$sa = split /$delim/;
push #$ar, $sa;
}
close $fh;
$obj->{ar} = $ar;
}

I would do this as a series of nested loops:
Loop 1, iterate through all files in the folder. pass the file name to next loop
Loop 2 Open File, Iterate through sheets
Loop 3 in sheet X, Loop through the columns (A) > 1
Loop 4 - Loop through rows:
Read sheet X, Row B,
Get values from (Row B, Column 1) as Date, (Row 1, Column A) as Product. (Row B, Column A) as Price - write to destination.
End Loop 4
(optional, at the end of the column record some meta data about number of rows )
End of Loop 3
(optional, record some meta data about the number of columns in the sheet)
End of Loop 2
(optional, record some meta data about the number of sheets in file X)
End of Loop 1
(strongly suggested - record some meta data about the file X and number of sheets/rows/columns - you can test a sample later for your own confidence)
You may want to modify a copy of one of your files so that you can test for issues.
sheet empty
invalid data
missing header
text instead of price
This will give more confidence and will shorten the rework that is required when you discover new edge cases.
Final output table should be the denormalised data in 3 columns:
Date,
Product,
Price
EDIT
Here is a link that shows how you can dynamically loop through the columns of an Excel spreadsheet (and sheets) so that you can use this process to unpivot the data to a normalised form
Looping through Excel columns in SSIS

Related

Perl: Inserting values into specific columns of CSV file

I have CSV data of the form:
S.No,Label,Customer1,Customer2,Customer3...
1,label1,Y,N,Y
2,label2,N,Y,N
...
I need to reproduce the "label" to the left of "customer" columns marked with Y - and have nothing ("") to the left of columns marked with N.
Expected output:
S.No,Label,Customer1,Customer1,Customer2,Customer2,Customer3,Customer3...
1,label1,label1,Y,"",N,label1,Y
2,label2,"",N,label2,Y,"",N
When opened using Excel, it would look like this:
S.No Label Customer1 Customer1 Customer2 Customer2 Customer3 Customer3...
1 label1 label1 Y N label1 Y
2 label2 N label2 Y N
The two leftmost columns, referring to S.No and the original "Label" column, are constant.
What is the simplest way to do this? I tried the following code:
use strict;
use warnings;
my $nonIncludesFile = "nonIncludes.csv";
open(my $xfh, "+>", $nonIncludesFile) or warn "Unable to open $nonIncludesFile, $!";
chomp( my $header = <$xfh> );
my #names = split ",", $header;
my #names1;
my #fields;
my #fields1;
for(my $j=0; $j< scalar(#names); $j++)
{
$names1[$j] = $names[$j];
}
while(<$xfh>)
{
my $nonIncLine = $_;
$nonIncLine = chomp($nonIncLine);
#fields = split ",", $nonIncLine;
next if $. == 1; #skip the first line
for(my $i = 0; $i < scalar(#fields) -2; $i++) #Number of "customers" = scalar(#fields) -2
{
$fields1[0] = $fields[0];
$fields1[1] = $fields[1];
if('Y' eq $fields[ $i + 2 ])
{
$fields1[$i+2] = 'Y';
substr(#fields1, $i + 1, 0, $fields[1]); #insert the label to the left - HERE
}
else
{
$fields1[$i+2] = 'N';
substr(#fields1, $i + 1, 0, "");
}
}
}
print $xfh #names1;
print $xfh #fields1;
close($xfh);
This however complains of "substr outside of string" at the line marked by "HERE".
What am I doing wrong? And is there any simpler (and better) way to do this?
Something like this maybe?
#!/usr/bin/perl
use strict;
use warnings;
#read the header row
chomp( my ( $sn, $label, #customers ) = split( /,/, <DATA> ) );
#double the 'customers' column headings (one is suffixed "_label")
print join( ",", $sn, $label, map { $_ . "_label", $_ } #customers ), "\n";
#iterate data
while (<DATA>) {
#strip trailing linefeed
chomp;
#extract fields with split - note breaks if you've quoted commas inline.
my ( $sn, $label, #row ) = split /,/;
print "$sn,$label,";
#iterate Y/N values, and either prints "Y" + label, or anything else + blank.
foreach my $value (#row) {
print join( ",", $value eq "Y" ? $label : "", $value ),",";
}
print "\n";
}
__DATA__
S.No,Label,Customer1,Customer2,Customer3
1,label1,Y,N,Y
2,label2,N,Y,N
Assumes you don't have any fruity special characters (e.g. commas) in the fields, because it'll break if you do, and you might want to consider Text::CSV instead.
It is always much better to post some usable test data than write a something like this question
However, it looks like your data has no quoted fields or escaped characters, so it looks like you can just use split and join to process the CSV data
Here's a sample Perl program that fulfils your requirement. The example output uses your data as it is. Each line of data has to be processed backwards so that the insertions don't affect the indices of elements that are yet to be processed
use strict;
use warnings 'all';
use feature 'say';
while ( <DATA> ) {
chomp;
my #fields = split /,/;
for ( my $i = $#fields; $i > 1; --$i ) {
my $newval =
$. == 1 ? $fields[$i] :
lc $fields[$i] eq 'y' ? $fields[1] :
'';
splice #fields, $i, 0, $newval;
}
say join ',', #fields;
}
__DATA__
S.No,Label,Customer1,Customer2,Customer3...
1,label1,Y,N,Y
2,label2,N,Y,N
output
S.No,Label,Customer1,Customer1,Customer2,Customer2,Customer3...,Customer3...
1,label1,label1,Y,,N,label1,Y
2,label2,,N,label2,Y,,N

ID tracking while swapping and sorting other two arrays in perl

#! /usr/bin/perl
use strict;
my (#data,$data,#data1,#diff,$diff,$tempS,$tempE, #ID,#Seq,#Start,#End, #data2);
#my $file=<>;
open(FILE, "< ./out.txt");
while (<FILE>){
chomp $_;
#next if ($line =~/Measurement count:/ or $line =~/^\s+/) ;
#push #data, [split ("\t", $line)] ;
my #data = split('\t');
push(#ID, $data[0]);
push(#Seq, $data[1]);
push(#Start, $data[2]);
push(#End, $data[3]);
# push #$data, [split ("\t", $line)] ;
}
close(FILE);
my %hash = map { my $key = "$ID[$_]"; $key => [ $Start[$_], $End[$_] ] } (0..$#ID);
for my $key ( %hash ) {
print "Key: $key contains: ";
for my $value ($hash{$key} ) {
print " $hash{$key}[0] ";
}
print "\n";
}
for (my $j=0; $j <=$#Start ; $j++)
{
if ($Start[$j] > $End[$j])
{
$tempS=$Start[$j];
$Start[$j]=$End[$j];
$End[$j]=$tempS;
}
print"$tempS\t$Start[$j]\t$End[$j]\n";
}
my #sortStart = sort { $a <=> $b } #Start;
my #sortEnd = sort { $a <=> $b } #End;
#open(OUT,">>./trial.txt");
for(my $i=1521;$i>=0;$i--)
{
print "hey";
my $diff = $sortStart[$i] - $sortStart[$i-1];
print "$ID[$i]\t$diff\n";
}
I have three arrays of same length, ID with IDs (string), Start and End with integer values (reading from a file).
I want to loop through all these arrays and also want to keep track of IDs. First am swapping elements in Start with End if Start > End, then I have to sort these two arrays for further application (as I am negating Start[0]-Start[1] for each item in that Start). While sorting, the Id values may change, and as my IDs are unique for each Start and End elements, how can I keep track of my IDs while sorting them?
Three arrays, ID, Start and End, are under my consideration.
Here is a small chunk of my input data:
DQ704383 191990066 191990037
DQ698580 191911184 191911214
DQ724878 191905507 191905532
DQ715191 191822657 191822686
DQ722467 191653368 191653339
DQ707634 191622552 191622581
DQ715636 191539187 191539157
DQ692360 191388765 191388796
DQ722377 191083572 191083599
DQ697520 189463214 189463185
DQ709562 187245165 187245192
DQ540163 182491372 182491400
DQ720940 180753033 180753060
DQ707760 178340696 178340726
DQ725442 178286164 178286134
DQ711885 178250090 178250119
DQ718075 171329314 171329344
DQ705091 171062479 171062503
The above ID, Start, End respectively. If Start > End i swapped them only between those two arrays. But after swapping the descending order may change, but i want them in descending order also their corresponding ID for negation as explained above.
Don't use different arrays, use a hash to keep the related pieces of information together.
#!/usr/bin/perl
use warnings;
use strict;
use enum qw( START END );
my %hash;
while (<>) {
my ($id, $start, $end) = split;
$hash{$id} = [ $start < $end ? ($start, $end)
: ($end, $start) ];
}
my #by_start = sort { $hash{$a}[START] <=> $hash{$b}[START] } keys %hash;
my #by_end = sort { $hash{$a}[END] <=> $hash{$b}[END] } keys %hash;
use Test::More;
is_deeply(\#by_start, \#by_end, 'same');
done_testing();
Moreover, in the data sample you provided, the order of id's is the same regardless of by what you sort them.

Splitting arrays in Perl

I am new to Perl
I have below data into arrays, each product is a value of this array,
print #array
Product A
Date=2.3.13
Value=100
Time=2.12
Product B
Date=4.5.12
value=20
Description=Goods
Product C
...
...
These products can change from A to Z, and their attributes can change as well, can some one help me getting data in below format
Product Date Value Descrip Time
A 2.3.13 100 -Nil- -Nil-
B 4.5.12 20 Goods
Based on additional input from the author and the comments, here's an updated answer.
Prepare a data structure (hash) to store data for one product.
Read in the file product-by-product, by parsing consequtive lines. Store the data for one product in memory.
Print the data as soon as you are finished with one product (an empty line seems to be a separator). Clear the data. Repeat.
In total, this would look like this:
my $empty = {
name => "",
date => "",
value => ""
# etc
};
my $product = $empty;
while (my $line = <$file>) {
if ($line =~ m/^Product\s+(.*)$) {
$product->{name} = $1;
} elsif ($line =~ m/^\s+Date=(\d+\.\d+\.\d+)$/) {
$product->{date} = $1;
} elsif () {
# etc.
} elsif ($line eq "") {
# Empty line
print $product->{name} . "\t" . $product->{date} . "\n"; #etc
# Clear the product data
$product = $empty;
}
}

Transform/pivot array in perl

Im stuck writing Perl code which transforms 2d array.
First column of the array is always date
Second column of the array is key that sorts.
Data is located in array "data" and is ordered by date and then key.
My situation should be understandable from the tables under. Unique values from the second column will be selected and later divided into columns header (green table)
It should work with and number of columns or dates/keys.
Structure before
Structure after
My code:
#creates filtered array of all the unique dates and its count
my #date = #{ $data->[0] };
my #filtDate = uniq #date;
my $countFiltDate = scalar #filtDate;
#unique list of keys
my #klic = #{ $data->[1] };
my #filtKlic = uniq #klic;
#orders filtered keys
#filtKlic = sort #filtKlic;
my $countFiltKlic = scalar #filtKlic;
#count of columns
my $columnsCount = scalar #{$data};
#test code - counts how many new number columns to make.
my $columnsCountAfter = ( $columnsCount - 2 ) * $countFiltKlic;
#inserst filtered dates into first column
my $dataGraph;
for ( my $i = 0; $i < $countFiltDate; $i++ ) {
$dataGraph->[0]->[$i] = #filtDate[$i];
}
#biggest loop with number of dates
for ( my $k = 0; $k < $countFiltDate; $k++ ) {
my $l;
my $c;
#columns sount k $i
for ( my $i = 0; $i < $columnsCount - 2; $i++ ) {
#loop for different keys k $j
for ( my $j = 0; $j < $countFiltKlic; $j++ ) {
$l++; #riadok v prvej tabulke
#EVERYTHING after this part is written horibly.
# I'm trying to make it work even
#if key values are missing.
for ( my $m = 0; $m < 5; $m++ ) {
if ( $data->[1]->[ $l - 1 ] eq $filtKlic[$j] ) {
print " [" . $data->[1]->[ ( $l - 1 ) ] . ',' . $filtKlic[$j] . "]";
$dataGraph->[ $l + $c ]->[$k] = $data->[ $i + 2 ]->[ ( ( $k * $countFiltKlic ) + $j ) ];
#print " [".$data->[1]->[$j].','.($filtKlic[($j)])."]-";
print " [" . ( $i + 2 ) . ',' . ( ( $k * $countFiltKlic ) + $j ) . "]-";
print " [" . ( $l + $c ) . ',' . $k . "]<br>";
$m = 5; #just random number... i don't want to get infinite loops during testing
} else {
if ( $m = 5 ) {
$l--;
$c++;
}
$j++;
}
}
}
}
}
my #nameStlpceKlic;
#nameStlpceKlic[0] = "date";
my $o;
for ( my $i = 0; $i < $columnsCount - 2; $i++ ) {
foreach (#filtKlic) {
my $o;
$o++;
#nameStlpceKlic[$o] = #filtKlic[ ( $o - 1 ) ] . $i;
}
}
I have 2 problems.
How to make sure that this will work even if some of the key are missing at some dates.
How to write it properly. My code is too clumsy.
Here is my general approach for solving this kind of problem.
In the second table, you're grouping your data by the date, then displaying the values for number1 and the values for number2. This should give you a hint as to how you want to organise your data structure and what you need to index for printing.
Your current data is (I assume) stored in an array of arrays. I was too lazy to copy the values, so I made my own AoA with made up values. I've sprinkled comments through the code so you can see how I worked on this.
my $arr = [
['date','key','number1','number2'],
['22.12.2013','1a','1a1-34567','1a2-1234567'],
['22.12.2013','2b','2b1-3249871','2b2-4597134'],
['22.12.2013','3c','3c1-1234567',''],
['22.12.2013','4d','4c1-3249871','4c2-4597134'],
['22.13.2013','1a','1a1-34567','1a2-1234567'],
['22.13.2013','2b','','2b2-4597134'],
['22.13.2013','3c','3c1-1234567','3c2-1234567'],
['22.13.2013','4d','4c1-3249871','4c2-4597134'],
];
# first, remove the first row, which contains the column headers.
my $col_h = shift #$arr;
my $data;
my $key_list;
foreach (#$arr) {
my %hash;
# use a hash slice with the column header array as keys
# and the array as the values
#hash{#$col_h} = #$_;
# store this hash in a data hash indexed by date then key
$data->{ $hash{date} }{ $hash{key} } = \%hash;
# compile a separate hash with the keys in it
$key_list->{ $hash{key} }++;
}
# make a sorted list of keys, ready for printing
my #key_list = sort keys %$key_list;
# remove the first two elements from the column headers ('date' and 'key')
splice(#$col_h, 0, 2);
# print out the header row for the table (I'm doing a simple tab-delim'd table)
print STDERR "Date\t";
# for each NUMBER from NUMBER1 ... NUMBERn
foreach my $c (#$col_h) {
# print "keyID NUMBERn"
map { print STDERR "$_ $c\t" } #key_list;
}
print STDERR "\n";
# Now print out the data itself. Sort by date...
foreach my $date (sort keys %$data) {
print STDERR "$date\t";
# for each NUMBER1 ... NUMBERn
foreach my $header (#$col_h) {
foreach my $key (#key_list) {
## print out the value OR - if there is no value
print STDERR ( $data->{$date}{$key}{$header} || "-" ) . "\t";
}
}
print STDERR "\n"; # end of the table row
}
Output (with tabs expanded for display purposes):
Date 1a number1 2b number1 3c number1 4d number1 1a number2 2b number2 3c number2 4d number2
22.12.2013 1a1-34567 2b1-3249871 3c1-1234567 4c1-3249871 1a2-1234567 2b2-4597134 - 4c2-4597134
22.13.2013 1a1-34567 - 3c1-1234567 4c1-3249871 1a2-1234567 2b2-4597134 3c2-1234567 4c2-4597134
I was able to put together code that works using great answer from "i alarmed alien" .
First thing that is different is that my data are formatted as array of arrays in transposed way.
$arr1 = [ '2013-12-22', '2013-12-22' ];
$arr2 = [ 'Number1','Number2'];
$arr3 = [ '2328942', '679204'];
$arr4 = [ '1450398', '436713'];
Also transformed data should be saved in an array. I've written this piece of code. ( It's far from perfect, if there are any suggestions how to improve it further I'd be happy to hear those.)
####################
#transpose data
my $datas = $args{DATA};
my $headers = $args{HEADERS};
my #rows = ();
my #transposed = ();
for my $row (#$datas) {
for my $column (0 .. $#{$row}) {
push(#{$transposed[$column]}, $row->[$column]);
}
}
#################################
my #arr = #transposed;
# first, define headers.
my $col_h = $args{HEADERS};
my $data;
my $key_list;
foreach (#arr) {
my %hash;
# use a hash slice with the column header array as keys
# and the array as the values
#hash{#$col_h} = #$_;
# store this hash in a data hash indexed by date then key
$data->{ $hash{date} }{ $hash{key} } = \%hash;
# compile a separate hash with the keys in it
$key_list->{ $hash{key} }++;
}
# make a sorted list of keys, ready for printing
my #key_list = sort keys %$key_list;
# remove the first two elements from the column headers ('date' and 'key')
splice(#$col_h, 0, 2);
my #output;
my #header;
# print out the header row for the table (I'm doing a simple tab-delim'd table)
#print STDERR "Date\t";
push(#header, "Date\t");
# for each NUMBER from NUMBER1 ... NUMBERn
foreach my $c (#$col_h) {
# print "keyID NUMBERn"
map { push (#header,"$_ $c\t" )} #key_list;
#map { print STDERR "$_ $c\t" } #key_list;
}
#print STDERR "<br>";
push (#output,\#header );
my $row;
my $column;
# Now print out the data itself. Sort by date...
foreach my $date (sort keys %$data) {
#print STDERR "$date\t";
$row++;
my #line;
push(#line, "$date");
# for each NUMBER1 ... NUMBERn
foreach my $header (#$col_h) {
foreach my $key (#key_list) {
## print out the value OR - if there is no value
$column++;
push (#line,( $data->{$date}{$key}{$header} || "-" ) . "\t");
#print STDERR ( $data->{$date}{$key}{$header} || "-" ) . "\t";
}
}
print STDERR "<br>"; # end of the table row
$column = 0;
push (#output,\#line );
}
my $x = 1;
return #output;
}
This code works but it's little ugly. Please let me know If there is cleaner/better way to do this.

How to search through array elements for match in hash keys

I've an array that contains unique IDs (numeric) for DNA sequences. I've put my DNA sequences in a hash so that each key contains a descriptive header, and its value is the DNA sequence. Each header in this list contains gene information and is suffixed with its unique ID number:
Unique ID: 14272
Header(hash key): PREDICTEDXenopusSiluranatropicalishypotheticalproteinLOCLOCmRNA14272
Sequence (hash value): ATGGGTC...
I want to cycle through each Unique ID and see if it matches the number at the end of each header(hash key) and, if so, print the hash key + value into a file. So far I've got this:
my %hash;
#hash{#hash_index} = #hash_seq;
foreach $hash_index (sort keys %hash) {
for ($i=0; $i <= $#scaffoldnames; $i++) {
if ($hash_index =~ /$scaffoldnames[$i]/) {
print GENE_ID "$hash_index\n$hash{$hash_index}\n";
}
}
}
close(GENE_ID);
Whereby the unique IDs are contained within #scaffoldnames.
This doesn't work! I'm unsure as to how best to loop through both the hash and the array to find a match.
I'll expand below:
Upstream code:
foreach(#scaffoldnames) {
s/[^0-9]*//g;
} #Remove all non-numerics
my #genes = read_file('splice.txt'); #Splice.txt is a fasta file
my $hash_index = '';
my $hash_seq = '';
foreach(#genes){
if (/^>/){
my $head = $_;
$hash_index .= $head; #Collect all heads for hash
}
else {
my $sequence = $_;
$hash_seq .= $sequence; #Collect all sequences for hash
}
}
my #hash_index = split(/\n/,$hash_index); #element[0]=head1, element[1]=head2
my #hash_seq = split(/\n/, $hash_seq); #element[0]=seq1, element[1]=seq2
my %hash; # Make hash from both arrays - heads as keys, seqs as values
#hash{#hash_index} = #hash_seq;
foreach $hash_index (sort keys %hash) {
for ($i=0; $i <= $#scaffoldnames; $i++) {
if ($hash_index =~ /$scaffoldnames[$i]$/) {
print GENE_ID "$hash_index\n$hash{$hash_index}\n";
}
}
}
close(GENE_ID);
I'm trying to isolate all differently expressed genes (by unique ID) as outputted by cuffdiff (RNA-Seq) and relate them to the scaffolds (in this case expressed sequences) from which they came.
I'm hoping therefore that I can take isolate each unique ID and search through the original fasta file to pull out the header it matches and the sequence it's associated with.
You seem to have missed the point of hashes: they are used to index your data by keys so that you can access the relevant information in one step, like you can with arrays. Looping over every hash element kinda spoils the point. For instance, you wouldn't write
my $value;
for my $i (0 .. $#data) {
$value = $data[i] if $i == 5;
}
you would simply do this
my $value = $data[5];
It is hard to help properly without some more information about where your information has come from and exactly what it is you want, but this code should help.
I have used one-element arrays that I think look like what you are using, and built a hash that indexes both the header and the sequence as a two-element array, using the ID (the trailing digits of the header) as a key. The you can just look up the information for, say, ID 14272 using $hash{14272}. The header is $hash{14272}[0] and the sequence is $hash{14272}[1]
If you provide more of an indication about your circumstances then we can help you further.
use strict;
use warnings;
my #hash_index = ('PREDICTEDXenopusSiluranatropicalishypotheticalproteinLOCLOCmRNA14272');
my #hash_seq = ('ATGGGTC...');
my #scaffoldnames = (14272);
my %hash = map {
my ($key) = $hash_index[$_] =~ /(\d+)\z/;
$key => [ $hash_index[$_], $hash_seq[$_] ];
} 0 .. $#hash_index;
open my $gene_fh, '>', 'gene_id.txt' or die $!;
for my $name (#scaffoldnames) {
next unless my $info = $hash{$name};
printf $gene_fh "%s\n%s\n", #$info;
}
close $gene_fh;
Update
From the new code you have posted it looks like you can replace that section with this code.
It works by taking the trailing digits from every sequence header that it finds, and using that as a key to choose a hash element to append the data to. The hash values are the header and the sequence, all in a single string. If you have a reason for keeping them separate then please let me know.
foreach (#scaffoldnames) {
s/\D+//g;
} # Remove all non-numerics
open my $splice_fh, '<', 'splice.txt' or die $!; # splice.txt is a FASTA file
my %sequences;
my $id;
while (<$splice_fh>) {
($id) = /(\d+)$/ if /^>/;
$sequences{$id} .= $_ if $id;
}
for my $id (#scaffoldnames) {
if (my $sequence = $sequences{$id}) {
print GENE_ID $sequence;
}
}

Resources