Related
I successfully create a hash of arrays, and I am using it to calculate log-odds scores for each DNA sequence from a file (Creating a hash of arrays for DNA sequences, Perl has input file format). I get a score for each sequence, but I get a warning for each calculation. Naturally, I want to clear up the warning. The warning is: Use of uninitialized value in string eq at line 148.
Here is a summarized version of the code (I can post the full code if necessary):
use strict;
use warnings;
use Data::Dumper;
#USER SPECIFICATIONS
print "Please enter the filename of the fasta sequence data: ";
my $filename1 = <STDIN>;
#Remove newline from file
chomp $filename1;
#Open the file and store each dna seq in hash
my %id2seq = ();
my %HoA = ();
my %loscore = ();
my $id = '';
open (FILE, '<', $filename1) or die "Cannot open $filename1.",$!;
my $dna;
while (<FILE>)
{
if($_ =~ /^>(.+)/)
{
$id = $1; #Stores 'Sequence 1' as the first $id, for example
}
else
{
$HoA{$id} = [ split(//) ]; #Splits the contents to allow for position reference later
$id2seq{$id} .= $_; #Creates a hash with each seq associated to an id number, used for calculating tables that have been omitted for space
$loscore{$id} .= 0; #Creates a hash with each id number to have a log-odds score
}
}
close FILE;
#User specifies motif width
print "Please enter the motif width:\n";
my $width = <STDIN>;
#Remove newline from file
chomp $width;
#Default width is 3 (arbitrary number chosen)
if ($width eq '')
{
$width = 3;
}
#Omitting code about $width<=0, creation of log-odds score hash to save space
foreach $id (keys %HoA, %loscore)
{
for my $pos (0..($width-1))
{
for my $base (qw( A C G T))
{
if ($HoA{$id}[$pos] eq $base) #ERROR OCCURS HERE
{
$loscore{$id} += $logodds{$base}[$pos];
}
elsif ( ! defined $HoA{$id}[$pos])
{
print "$pos\n";
}
}
}
}
print Dumper(\%loscore);
The output I get is:
Use of uninitialized value in string eq at line 148, <STDIN> line 2.
2
(This error repeats 4 times for each position - most likely due to matching to each $base?)
$VAR1 = {
'Sequence 15' => '-1.27764697876093',
'Sequence 4' => '0.437512962981119',
(continues for 29 sequences)
}
To summarize, I want to calculate the log-odds score of each sequence. I have a log-odds score hash %loscore that contains the score of a base at each location within a motif. The log-odds score is calculated by summing the referenced values. For example, if the log-odds table was
A 4 3 2
C 7 2 1
G 6 9 2
T 1 0 3
The log-odds score of the sequence CAG would be 7+3+2=12.
At the moment, I believe that the error occurs because of the way I split the strings of DNA to be put into the hash of arrays. As I previously stated, if you want all the code so you can copy-paste, I can provide it. I think the solution is pretty simple, and I just need someone to point me in the right direction. Any and all help is appreciated, and I can clarify as questions arise. Also, any tips that could help me to post more concise questions are appreciated (I know this one is lengthy, I just want to provide enough background information).
Here is the code that I am using to iterate through the `%HoA. It calculates a log-odds score for each sequence, then works through each sequence to find a maximum score for each sequence. Big thanks to everyone for helping out!
foreach $id (keys %HoA)
{
for my $pos1 (0..length($HoA{$id})-1)
{
for my $pos2 ($pos1..$pos1+($width-1))
{
for my $base (qw( A C G T))
{
if ($HoA{$id}[$pos2] eq $base)
{
for my $pos3 (0..$width-1)
{
$loscore{$id} += $logodds{$base}[$pos3];
if ($loscore{$id} > $maxscore{$id})
{
$maxscore{$id} = $loscore{$id};
}
}
}
elsif ( ! defined $HoA{$id}[$pos2])
{
print "$pos2\n";
}
}
}
}
}
I have been struggling with this for a while in a Perl script I have. Probably a slam dunk for you Perl experts, and probably should be easier, but I can't quite crack the nut on this. I might be needing to split this, not sure.
My array code as is follows.
while ( my $row = $query_handle->fetchrow_hashref('NAME_lc') ){
push #query_output, $row;
push (#{portfo->{label}},$row->{data},$row->{label});
}
And then my print of the array is as follows:
print "array here--";
print "[";
foreach (#{portfo->{label}}) {
#(#{portfo->{label}},$row->{data});
print "\{\"data\":";
print "".$_.",";
print "\"label\":";
print "\"".$row[1]."\"\},";
}
print "]";
print "\n";
And then my output looks like this:
[{"data":2943,"label":""},{"data":CDI3,"label":""},
{"data":1,"label":""},{"data":COS-COS2,"label":""},
{"data":1087,"label":""},{"data":COS1,"label":""},
{"data":5183,"label":""},{"data":COS2,"label":""},
{"data":2731,"label":""},{"data":CULB,"label":""},{"data":1,"label":""},
{"data":EQUIT,"label":""},{"data":4474,"label":""},
{"data":Network,"label":""},]
I am trying to make the apha-num string array items like CDI3, COS1, COS2, etc in quotes, in the label part. Somehow I'm getting it separated. Meanwhile, I do want the numeric values left with the "data" name pair.
[{"data":2943,"label":""},{"data":"CDI3","label":""},
{"data":1,"label":""},{"data":"COS-COS2","label":""},
{"data":1087,"label":""},{"data":"COS1","label":""},
{"data":5183,"label":""},{"data":"COS2","label":""},
{"data":2731,"label":""},{"data":"CULB","label":""},{"data":1,"label":""},
{"data":"EQUIT","label":""},{"data":4474,"label":""},
{"data":"Network","label":""}]
I'm sure it's a simpler fix that I'm making it but so far no luck. Any tips would be greatly appreciated!
Thanks!
use JSON::XS qw( encode_json );
my #data;
while ( my $row = $query_handle->fetchrow_hashref('NAME_lc') ) {
# If $row->{data} is a number,
# make sure it's stored as a number
# so that it gets serialized as a number.
$row->{data} += 0 if $row->{data} =~ /^\d+\z/;
push #data, $row;
}
print(encode_json(\#data));
Or
my $data = $query_handle->fetchall_arrayref({ data => 1, label => 1 });
for my $row (#$data) {
$row->{data} += 0 if $row->{data} =~ /^\d+\z/;
}
print(encode_json($data));
Or if you ensure the fields names are returned as lowercase[1],
my $data = $query_handle->fetchall_arrayref({});
for my $row (#$data) {
$row->{data} += 0 if $row->{data} =~ /^\d+\z/;
}
print(encode_json($data));
This can be done using $dbh->{FetchHashKeyName} = 'NAME_lc'; or AS `label`.
I've looked for a solution to this here on stack overflow and elsewhere, but can't find examples that deal with the volume I have to work with. If I have missed a solution to this that has been posted elsewhere, I'd be very grateful if someone could point me into the right direction.
I'm trying to import time series data from 45 different Excel worksheets (about 5 per Excel workbook). Each worksheet contains commodity price series, covering several years of daily prices for each commodity.
The raw Excel data has one row for each day for which prices might exist and one column for each commodity contract, which typically is a monthly future contract. The points for each contract hence are at least 30 each (but not much more) while the entire table has several thousand rows and 100+ columns.
I was able to build a SSIS package that reads the data and using unpivot, transforms the matrix into row based records with columns for:
Date, Price, Contract
The problem however is that in the unpivot transform I have to manually specify the destination column for each transformed input column. So 45 worksheets, each containing 100+ columns (some even several hundred) for contracts I'd be ending up 'hard-coding' those transforms manually for the next few days... On top, this is not as flexible/re-usable as I was hoping.
Example of the raw data attached (the entire Cocoa worksheet contains 9724 rows and 195 columns)
Here's how the unpivot for another single commodity is configured. The 'Destination Column' has to be filled in manually row-by-row.
I'm hoping that I have just missed the right steps in the unpivot configuration to make these columns dynamic. Ideally the SSIS solution can be re-used with equally formatted Excel workbooks later again. It's not required to run this on the Server as it's not a frequently recurring thing but rather once or twice per year max. So i can easily kick this off manually from within VS.
I'm trying to help an academic researcher who would otherwise spend a massive amount of time cleaning and analysing the data manually in Excel.
First you need to design and create the tables that are going to receive the data and experiment with some manual data entry to check the data model.
Make sure each spreadsheet has enough header information to know how to process the rows.
When that is done, I would save the sheets to text files with tab delimiters.
Next I would write a loading program in Perl. It reads the header rows first and determines the rules for inserting the rows into the database. Then each row gets converted into an insert into the database.
Here is an example from an invoice loading program I own (all rights):
if ($first) {
$obj->_hdr2keys(0); # convert spreadhseet header into a lookup
my $hdr = $obj->_copystruct($obj->{ar}[0]);
my #Hhdr = ('invoice header id');
my #Hcols = ('invhid');
my #Htypes = ('serial');
my #Dhdr = ('invoice detail id');
my #Dcols = ('invdid','invhid');
my #Dtypes = ('serial','integer');
for (my $col=0; $col <= $#{$hdr}; $col++) {
my $colname = lc($obj->_pomp($hdr->[$col]));
if ($colname eq 'invoicenumber') {
push #Hhdr, $hdr->[$col];
push #Hcols, $colname;
push #Htypes, 'char(32)';
}
elsif ($colname eq 'buysell') {
push #Hhdr, $hdr->[$col];
push #Hcols, $colname;
push #Htypes, 'boolean';
}
elsif ($colname eq 'suppliercustomer') {
push #Hhdr, $hdr->[$col];
push #Hcols, $colname;
push #Htypes, 'char(64)';
}
elsif ($colname eq 'date') {
push #Hhdr, 'Transaction Date';
push #Hcols, 'transactiondate';
push #Htypes, 'date';
}
elsif ($colname eq 'article') {
push #Dhdr, 'Article id';
push #Dcols, 'artid';
push #Dtypes, 'integer';
push #Dhdr, 'Article Description';
push #Dcols, 'description';
push #Dtypes, 'char(64)';
}
elsif ($colname eq 'qty') {
push #Dhdr, $hdr->[$col];
push #Dcols, $colname;
push #Dtypes, 'integer';
}
elsif ($colname eq 'priceexclbtw') {
push #Dhdr, $hdr->[$col];
push #Dcols, $colname;
push #Dtypes, 'double precision';
}
elsif ($colname eq 'btw') {
push #Dhdr, $hdr->[$col];
push #Dcols, $colname;
push #Dtypes, 'real';
}
}
$obj->_getset('INVHar',
['invoiceheader',
['PK','invhid'],
['__COLUMNS__'],
\#Hcols,
\#Htypes,
\#Hhdr
]
);
$obj->_getset('INVDar',
['invoicedetail',
['PK','invdid'],
['FK','invhid','invoiceheader','invhid'],
['FK','artid','article','artid'],
['__COLUMNS__'],
\#Dcols,
\#Dtypes,
\#Dhdr
]
);
}
$first = 0;
SALESROW: for (my $i=1; $i <= $#{$obj->{ar}}; $i++) {
my #Hrow = ('');
my #Drow = ('');
my $date = $obj->_selectar('', $i, 'Date');
$date =~ s/\-/\//g;
if ($date) {
$obj->_validCSV('date', $date)
or die "CSV format error date |$date| in file $file";
}
my $invtotal = ($obj->_selectar('', $i, 'Invoice Total incl. BTW'));
my $article = $obj->_selectar('', $i, 'Article');
$date or $article or next SALESROW;
if ($date) {
push #Hrow, $obj->_selectar('', $i, 'Invoice Number');
my $buysell = $obj->_selectar('', $i, 'Buy/Sell');
push #Hrow, ($buysell eq 'S') ? 1 : 0;
push #Hrow, $obj->_selectar('', $i, 'Supplier/Customer');
push #Hrow, $date;
push #{$obj->_getset('INVHar')}, \#Hrow;
$invhid++;
}
push #Drow, $invhid;
if ($article eq 'E0154') {
push #Drow, 1;
}
elsif ($article eq 'C0154') {
push #Drow, 2;
}
elsif ($article eq 'C0500') {
push #Drow, 3;
}
elsif ($article eq 'C2000') {
push #Drow, 4;
}
elsif ($article eq 'C5000') {
push #Drow, 5;
}
else {
die "unrecognised sales article $article\n"
. Dumper($obj->{ar}[$i]);
}
push #Drow, undef; # description is in article table
push #Drow, $obj->_selectar('', $i, 'Qty.');
push #Drow, $obj->_selectar('', $i, 'Price excl. BTW');
push #Drow, $obj->_selectar('', $i, 'BTW %');
push #{$obj->_getset('INVDar')}, \#Drow;
}
This creates header and detail records for invoices after the product table has already been loaded from another spreadsheet.
In the above example two array of arrays are created, INVHar and INVDar. When they are ready, the calling routine loads them into the database as follows. In this next code example the tables are created as well as the rows and also a metadb is updated for loading of future tables and managing foreign keys for existing tables. The array created in the previous snippet contains all information needed to create the table and insert the rows. There is also a simple routine _DBdatacnv that converts between the formats in the spreadsheet and the formats needed in the database. For example, the spreadsheet had currency symbols that need to be stripped before insertion.
sub _arr2db {
my ($obj) = #_;
my $ar = $obj->_copystruct($obj->_getset('ar'));
my $dbh = $obj->_getset('CDBh');
my $mdbh = $obj->_getset('MDBh');
my $table = shift #$ar;
$mdbh->{AutoCommit} = 0;
$dbh->{AutoCommit} = 0;
my #tables = $mdbh->selectrow_array(
"SELECT id FROM mtables
WHERE name = \'$table\'"
);
my $id = $tables[0] || '';
if ($id) {
$mdbh->do("DELETE FROM mcolumns where tblid=$id");
$mdbh->do("DELETE FROM mtables where id=$id");
}
# process constraints
my %constraint;
while ($#{$ar} >= 0
and $ar->[0][0] ne '__COLUMNS__') {
my $cts = shift #$ar;
my $type = shift #$cts;
if ($type eq 'PK') {
my $pk = shift #$cts;
$constraint{$pk} ||= '';
$constraint{$pk} .= ' PRIMARY KEY';
#$cts and die "unsupported compound key for $table";
}
elsif ($type eq 'FK') {
my ($col, $ft, $fk) = #$cts;
$ft && $fk or die "incomplete FK declaration in CSV for $table";
$constraint{$col} ||= '';
$constraint{$col} .=
sprintf( ' REFERENCES %s(%s)', $ft, $fk );
}
elsif ($type eq 'UNIQUE') {
while (my $uk = shift #$cts) {
$constraint{$uk} ||= '';
$constraint{$uk} .= ' UNIQUE';
}
}
elsif ($type eq 'NOT NULL') {
while (my $nk = shift #$cts) {
$constraint{$nk} ||= '';
$constraint{$nk} .= ' NOT NULL';
}
}
else {
die "unrecognised constraint |$type| for table $table";
}
}
shift #$ar;
unless ($mdbh->do("INSERT INTO mtables (name) values (\'$table\')")) {
warn $mdbh->errstr . ": mtables";
$mdbh->rollback;
die;
}
#tables = $mdbh->selectrow_array(
"SELECT id FROM mtables
WHERE name = \'$table\'"
);
$id = shift #tables;
$dbh->do("DROP TABLE IF EXISTS $table CASCADE")
or die $dbh->errstr;
my $create = "CREATE TABLE $table\n";
my $cols = shift #$ar;
my $types = shift #$ar;
my $desc = shift #$ar;
my $first = 1;
my $last = 0;
for (my $i=0; $i<=$#{$cols}; $i++) {
$last = 1;
if ($first) {
$first = 0;
$create .= "( "
}
else {
$create .= ",\n";
}
$create .= $cols->[$i]
. ' ' . $obj->_DBcnvtype($types->[$i]);
$constraint{$cols->[$i]}
and $create .= ' ' . $constraint{$cols->[$i]};
unless ($mdbh->do("INSERT INTO mcolumns (tblid,name,type,description)
values ($id,\'$cols->[$i]\',\'$types->[$i]\',\'$desc->[$i]\')"))
{
warn $mdbh->errstr;
$mdbh->rollback;
die;
}
}
$last and $create .= ')';
unless ($dbh->do($create)) {
warn $dbh->errstr;
$dbh->rollback;
die;
}
my $count = 0;
while (my $row = shift #$ar) {
$count++;
my $insert = "INSERT INTO $table (";
my $values = 'VALUES (';
my $first = 1;
for (my $i=0; $i<=$#{$cols}; $i++) {
my $colname = $cols->[$i];
unless (defined($constraint{$colname})
and $constraint{$colname} =~ /PRIMARY KEY/) {
if ($first) {
$first = 0;
}
else {
$insert .= ', ';
$values .= ', ';
}
$insert .= $colname;
my $val = $obj->_DBdatacnv('CSV', 'DB',
$types->[$i],$row->[$i]);
if ($val eq '%ABORT') {
$mdbh->rollback;
die;
}
$values .= $val;
}
}
$insert .= ')' . $values . ')';
unless ($dbh->do($insert)) {
warn $dbh->errstr;
warn $insert;
$mdbh->rollback;
die;
}
}
NOINSERT: $mdbh->commit;
$dbh->commit;
# warn "inserted $count rows into $table";
}
Update: ok I'll add the generic routine that converts from CSV to array ready for _arr2db above for all other cases I have for a system: The spreadsheets are first augmented with PK FK and other constraints followed by a header of column names for the database, a row of the database types (notional, actual are taken care of in _DBcnvdatatype) then a row of tags that go in the metadatabase and finally a token COLUMNS just before the rows of data to insert.
sub _csv2arr {
my ($obj, $csv ) = #_;
my $ar = [];
my $delim = $obj->_getset('csvdelim') || '\,';
my $basename = basename($csv);
$basename =~ s/\.txt$//;
$ar = [$basename];
open my $fh, $csv
or die "$!: $csv";
while (<$fh>) {
chomp;
my $sa = [];
#$sa = split /$delim/;
push #$ar, $sa;
}
close $fh;
$obj->{ar} = $ar;
}
I would do this as a series of nested loops:
Loop 1, iterate through all files in the folder. pass the file name to next loop
Loop 2 Open File, Iterate through sheets
Loop 3 in sheet X, Loop through the columns (A) > 1
Loop 4 - Loop through rows:
Read sheet X, Row B,
Get values from (Row B, Column 1) as Date, (Row 1, Column A) as Product. (Row B, Column A) as Price - write to destination.
End Loop 4
(optional, at the end of the column record some meta data about number of rows )
End of Loop 3
(optional, record some meta data about the number of columns in the sheet)
End of Loop 2
(optional, record some meta data about the number of sheets in file X)
End of Loop 1
(strongly suggested - record some meta data about the file X and number of sheets/rows/columns - you can test a sample later for your own confidence)
You may want to modify a copy of one of your files so that you can test for issues.
sheet empty
invalid data
missing header
text instead of price
This will give more confidence and will shorten the rework that is required when you discover new edge cases.
Final output table should be the denormalised data in 3 columns:
Date,
Product,
Price
EDIT
Here is a link that shows how you can dynamically loop through the columns of an Excel spreadsheet (and sheets) so that you can use this process to unpivot the data to a normalised form
Looping through Excel columns in SSIS
I've an array that contains unique IDs (numeric) for DNA sequences. I've put my DNA sequences in a hash so that each key contains a descriptive header, and its value is the DNA sequence. Each header in this list contains gene information and is suffixed with its unique ID number:
Unique ID: 14272
Header(hash key): PREDICTEDXenopusSiluranatropicalishypotheticalproteinLOCLOCmRNA14272
Sequence (hash value): ATGGGTC...
I want to cycle through each Unique ID and see if it matches the number at the end of each header(hash key) and, if so, print the hash key + value into a file. So far I've got this:
my %hash;
#hash{#hash_index} = #hash_seq;
foreach $hash_index (sort keys %hash) {
for ($i=0; $i <= $#scaffoldnames; $i++) {
if ($hash_index =~ /$scaffoldnames[$i]/) {
print GENE_ID "$hash_index\n$hash{$hash_index}\n";
}
}
}
close(GENE_ID);
Whereby the unique IDs are contained within #scaffoldnames.
This doesn't work! I'm unsure as to how best to loop through both the hash and the array to find a match.
I'll expand below:
Upstream code:
foreach(#scaffoldnames) {
s/[^0-9]*//g;
} #Remove all non-numerics
my #genes = read_file('splice.txt'); #Splice.txt is a fasta file
my $hash_index = '';
my $hash_seq = '';
foreach(#genes){
if (/^>/){
my $head = $_;
$hash_index .= $head; #Collect all heads for hash
}
else {
my $sequence = $_;
$hash_seq .= $sequence; #Collect all sequences for hash
}
}
my #hash_index = split(/\n/,$hash_index); #element[0]=head1, element[1]=head2
my #hash_seq = split(/\n/, $hash_seq); #element[0]=seq1, element[1]=seq2
my %hash; # Make hash from both arrays - heads as keys, seqs as values
#hash{#hash_index} = #hash_seq;
foreach $hash_index (sort keys %hash) {
for ($i=0; $i <= $#scaffoldnames; $i++) {
if ($hash_index =~ /$scaffoldnames[$i]$/) {
print GENE_ID "$hash_index\n$hash{$hash_index}\n";
}
}
}
close(GENE_ID);
I'm trying to isolate all differently expressed genes (by unique ID) as outputted by cuffdiff (RNA-Seq) and relate them to the scaffolds (in this case expressed sequences) from which they came.
I'm hoping therefore that I can take isolate each unique ID and search through the original fasta file to pull out the header it matches and the sequence it's associated with.
You seem to have missed the point of hashes: they are used to index your data by keys so that you can access the relevant information in one step, like you can with arrays. Looping over every hash element kinda spoils the point. For instance, you wouldn't write
my $value;
for my $i (0 .. $#data) {
$value = $data[i] if $i == 5;
}
you would simply do this
my $value = $data[5];
It is hard to help properly without some more information about where your information has come from and exactly what it is you want, but this code should help.
I have used one-element arrays that I think look like what you are using, and built a hash that indexes both the header and the sequence as a two-element array, using the ID (the trailing digits of the header) as a key. The you can just look up the information for, say, ID 14272 using $hash{14272}. The header is $hash{14272}[0] and the sequence is $hash{14272}[1]
If you provide more of an indication about your circumstances then we can help you further.
use strict;
use warnings;
my #hash_index = ('PREDICTEDXenopusSiluranatropicalishypotheticalproteinLOCLOCmRNA14272');
my #hash_seq = ('ATGGGTC...');
my #scaffoldnames = (14272);
my %hash = map {
my ($key) = $hash_index[$_] =~ /(\d+)\z/;
$key => [ $hash_index[$_], $hash_seq[$_] ];
} 0 .. $#hash_index;
open my $gene_fh, '>', 'gene_id.txt' or die $!;
for my $name (#scaffoldnames) {
next unless my $info = $hash{$name};
printf $gene_fh "%s\n%s\n", #$info;
}
close $gene_fh;
Update
From the new code you have posted it looks like you can replace that section with this code.
It works by taking the trailing digits from every sequence header that it finds, and using that as a key to choose a hash element to append the data to. The hash values are the header and the sequence, all in a single string. If you have a reason for keeping them separate then please let me know.
foreach (#scaffoldnames) {
s/\D+//g;
} # Remove all non-numerics
open my $splice_fh, '<', 'splice.txt' or die $!; # splice.txt is a FASTA file
my %sequences;
my $id;
while (<$splice_fh>) {
($id) = /(\d+)$/ if /^>/;
$sequences{$id} .= $_ if $id;
}
for my $id (#scaffoldnames) {
if (my $sequence = $sequences{$id}) {
print GENE_ID $sequence;
}
}
I have a text file layed out like this:
1 a, b, c
2 c, b, c
2.5 a, c
I would like to reverse the keys (the number) and values (CSV) (they are separated by a tab character) to produce this:
a 1, 2.5
b 1, 2
c 1, 2, 2.5
(Notice how 2 isn't duplicated for c.)
I do not need this exact output. The numbers in the input are ordered, while the values are not. The output's keys must be ordered, as well as the values.
How can I do this? I have access to standard shell utilities (awk, sed, grep...) and GCC. I can probably grab a compiler/interpreter for other languages if needed.
If you have python (if you're on linux you probably already have) i'd use a short python script to do this. Note that we use sets to filter out "double" items.
Edited to be closer to requester's requirements:
import csv
from decimal import *
getcontext().prec = 7
csv_reader = csv.reader(open('test.csv'), delimiter='\t')
maindict = {}
for row in csv_reader:
value = row[0]
for key in row[1:]:
try:
maindict[key].add(Decimal(value))
except KeyError:
maindict[key] = set()
maindict[key].add(Decimal(value))
csv_writer = csv.writer(open('out.csv', 'w'), delimiter='\t')
sorted_keys = [x[1] for x in sorted([(x.lower(), x) for x in maindict.keys()])]
for key in sorted_keys:
csv_writer.writerow([key] + sorted(maindict[key]))
I would try perl if that's available to you. Loop through the input a row at a time. Split the line on the tab then the right hand part on the commas. Shove the values into an associative array with letters as the keys and the value being another associative array. The second associative array will be playing the part of a set so as to eliminate duplicates.
Once you read the input file, sort based on the keys of the associative array, loop through and spit out the results.
here's a small utility in php:
// load and parse the input file
$data = file("path/to/file/");
foreach ($data as $line) {
list($num, $values) = explode("\t", $line);
$newData["$num"] = explode(", ", trim($values));
}
unset($data);
// reverse the index/value association
foreach ($newData as $index => $values) {
asort($values);
foreach($values as $value) {
if (!isset($data[$value]))
$data[$value] = array();
if (!in_array($index, $data[$value]))
array_push($data[$value], $index);
}
}
// printout the result
foreach ($data as $index => $values) {
echo "$index\t" . implode(", ", $values) . "\n";
}
not really optimized or good looking, but it works...
# use Modern::Perl;
use strict;
use warnings;
use feature qw'say';
our %data;
while(<>){
chomp;
my($number,$csv) = split /\t/;
my #csv = split m"\s*,\s*", $csv;
push #{$data{$_}}, $number for #csv;
}
for my $number (sort keys %data){
my #unique = sort keys %{{ map { ($_,undef) } #{$data{$number}} }};
say $number, "\t", join ', ', #unique;
}
Here is an example using CPAN's Text::CSV module rather than manual parsing of CSV fields:
use strict;
use warnings;
use Text::CSV;
my %hash;
my $csv = Text::CSV->new({ allow_whitespace => 1 });
open my $file, "<", "file/to/read.txt";
while(<$file>) {
my ($first, $rest) = split /\t/, $_, 2;
my #values;
if($csv->parse($rest)) {
#values = $csv->fields()
} else {
warn "Error: invalid CSV: $rest";
next;
}
foreach(#values) {
push #{ $hash{$_} }, $first;
}
}
# this can be shortened, but I don't remember whether sort()
# defaults to <=> or cmp, so I was explicit
foreach(sort { $a cmp $b } keys %hash) {
print "$_\t", join(",", sort { $a <=> $b } #{ $hash{$_} }), "\n";
}
Note that it will print to standard output. I recommend just redirecting standard output, and if you expand this program at all, make sure to use warn() to print any errors, rather than just print()ing them. Also, it won't check for duplicate entries, but I don't want to make my code look like Brad Gilbert's, which looks a bit wack even to a Perlite.
Here's an awk(1) and sort(1) answer:
Your data is basically a many-to-many data set so the first step is to normalise the data with one key and value per line. We'll also swap the keys and values to indicate the new primary field, but this isn't strictly necessary as the parts lower down do not depend on order. We use a tab or [spaces],[spaces] as the field separator so we split on the tab between the key and values, and between the values. This will leave spaces embedded in the values, but trim them from before and after:
awk -F '\t| *, *' '{ for (i=2; i<=NF; ++i) { print $i"\t"$1 } }'
Then we want to apply your sort order and eliminate duplicates. We use a bash feature to specify a tab char as the separator (-t $'\t'). If you are using Bourne/POSIX shell, you will need to use '[tab]', where [tab] is a literal tab:
sort -t $'\t' -u -k 1f,1 -k 2n
Then, put it back in the form you want:
awk -F '\t' '{
if (key != $1) {
if (key) printf "\n";
key=$1;
printf "%s\t%s", $1, $2
} else {
printf ", %s", $2
}
}
END {printf "\n"}'
Pipe them altogether and you should get your desired output. I tested with the GNU tools.