Eliminating unitialized values in my Perl hash of arrays - arrays

I successfully create a hash of arrays, and I am using it to calculate log-odds scores for each DNA sequence from a file (Creating a hash of arrays for DNA sequences, Perl has input file format). I get a score for each sequence, but I get a warning for each calculation. Naturally, I want to clear up the warning. The warning is: Use of uninitialized value in string eq at line 148.
Here is a summarized version of the code (I can post the full code if necessary):
use strict;
use warnings;
use Data::Dumper;
#USER SPECIFICATIONS
print "Please enter the filename of the fasta sequence data: ";
my $filename1 = <STDIN>;
#Remove newline from file
chomp $filename1;
#Open the file and store each dna seq in hash
my %id2seq = ();
my %HoA = ();
my %loscore = ();
my $id = '';
open (FILE, '<', $filename1) or die "Cannot open $filename1.",$!;
my $dna;
while (<FILE>)
{
if($_ =~ /^>(.+)/)
{
$id = $1; #Stores 'Sequence 1' as the first $id, for example
}
else
{
$HoA{$id} = [ split(//) ]; #Splits the contents to allow for position reference later
$id2seq{$id} .= $_; #Creates a hash with each seq associated to an id number, used for calculating tables that have been omitted for space
$loscore{$id} .= 0; #Creates a hash with each id number to have a log-odds score
}
}
close FILE;
#User specifies motif width
print "Please enter the motif width:\n";
my $width = <STDIN>;
#Remove newline from file
chomp $width;
#Default width is 3 (arbitrary number chosen)
if ($width eq '')
{
$width = 3;
}
#Omitting code about $width<=0, creation of log-odds score hash to save space
foreach $id (keys %HoA, %loscore)
{
for my $pos (0..($width-1))
{
for my $base (qw( A C G T))
{
if ($HoA{$id}[$pos] eq $base) #ERROR OCCURS HERE
{
$loscore{$id} += $logodds{$base}[$pos];
}
elsif ( ! defined $HoA{$id}[$pos])
{
print "$pos\n";
}
}
}
}
print Dumper(\%loscore);
The output I get is:
Use of uninitialized value in string eq at line 148, <STDIN> line 2.
2
(This error repeats 4 times for each position - most likely due to matching to each $base?)
$VAR1 = {
'Sequence 15' => '-1.27764697876093',
'Sequence 4' => '0.437512962981119',
(continues for 29 sequences)
}
To summarize, I want to calculate the log-odds score of each sequence. I have a log-odds score hash %loscore that contains the score of a base at each location within a motif. The log-odds score is calculated by summing the referenced values. For example, if the log-odds table was
A 4 3 2
C 7 2 1
G 6 9 2
T 1 0 3
The log-odds score of the sequence CAG would be 7+3+2=12.
At the moment, I believe that the error occurs because of the way I split the strings of DNA to be put into the hash of arrays. As I previously stated, if you want all the code so you can copy-paste, I can provide it. I think the solution is pretty simple, and I just need someone to point me in the right direction. Any and all help is appreciated, and I can clarify as questions arise. Also, any tips that could help me to post more concise questions are appreciated (I know this one is lengthy, I just want to provide enough background information).

Here is the code that I am using to iterate through the `%HoA. It calculates a log-odds score for each sequence, then works through each sequence to find a maximum score for each sequence. Big thanks to everyone for helping out!
foreach $id (keys %HoA)
{
for my $pos1 (0..length($HoA{$id})-1)
{
for my $pos2 ($pos1..$pos1+($width-1))
{
for my $base (qw( A C G T))
{
if ($HoA{$id}[$pos2] eq $base)
{
for my $pos3 (0..$width-1)
{
$loscore{$id} += $logodds{$base}[$pos3];
if ($loscore{$id} > $maxscore{$id})
{
$maxscore{$id} = $loscore{$id};
}
}
}
elsif ( ! defined $HoA{$id}[$pos2])
{
print "$pos2\n";
}
}
}
}
}

Related

Incorporate year matching sub-routine into script and apply condition to result

I have a script which reads a csv file line by line, and compares the title in field 2 of to another csv file. If 5 or more words match, the it prints out the line of each file which matches this criteria. Here is the script:
#!/bin/perl
#subroutine for discovering year
sub find_year {
my( $str ) = #_;
my $year = $1 if( $str =~ /\b((?:19|20)\d\d)\b/ );
return $year
}
#####CREATE CSV2 DATA
my #csv2 = ();
open CSV2, "<csv2" or die;
#csv2=<CSV2>;
close CSV2;
my %csv2hash = ();
my #csv2years;
for ( #csv2 ) {
chomp;
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/define the data which is the title
$csv2hash{$_} = $title; # Indicate that title data will input into csv2hash.
}
###### CREATE CSV1 DATA
open CSV1, "<csv1" or die;
while (<CSV1>) {
chomp; #removes new lines
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ creates variable of title
my %words;
$words{$_}++ for split /\s+/, $title; #/ get words
## Collect unique words into an array- the # means an array
my #titlewords = keys(%words);
# Add exception words which shouldn't be matched.
my #new;
foreach my $t (#titlewords){
push(#new, $t) if $t !~ /^(rare|vol|volume|issue|double|magazine|mag)$/i;
}
###### The comparison algorithm
#titlewords = #new;
my $desired = 5; # Desired matching number of words
my $matched = 0;
foreach my $csv2 (keys %csv2hash) {
my $count = 0;
my $value = $csv2hash{$csv2};
foreach my $word (#titlewords) {
my #matches = ( $value=~/\b$word\b/ig );
my $numIncsv2 = scalar(#matches);
#matches = ( $title=~/\b$word\b/ig );
my $numIncsv1 = scalar(#matches);
++$count if $value =~ /\b$word\b/i;
if ($count >= $desired || ($numIncsv1 >= $desired && $numIncsv2 >= $desired)) {
$count = $desired+1;
last;
}
}
if ($count >= $desired) {
print "$csv2\n";
++$matched;
}
}
print "$_\n\n" if $matched;
}
As you can see i've created a find_year subroutine which can be used to discover if the title contains a year in the 20th or 21st century (19xx or 20xx). I asked a question a few days ago which would allow me to assign a result to a set of conditions which involve matching a year and Borodin provided a great answer here.
Perl- What function am I looking for? Assigning multiple rules to a specified outcome
I want the same conditions to apply to now, only this time the script will be comparing dates in the title of the csv's rather than standard input and a data list (as in the previous question).
What I now want to do is include this logic as a function in my word matching script so that if the condition met in my previous question are considered Pass, then perform the word matching part of the script (i.e. 5 words match). If they match the Fail condition, then skip comparing the lines and move onto the next one (i.e. don't bother with the 5 matching word element of the script). The Pass and Fail result's don't have to be printed out, I am just using these words to describe the rules of the year comparison condition in my previous question.
example for csv1:
14564564,1987 the door to the other doors,546456,47878787
456456445,Mullholland Drive is the bets film ever 1959,45454545,45454545
456456445,The Twin Peaks forget that stuff,45454545,45454545
454654564, 1939 hello good world you are great ,45456456, 54564654
example for csv2:
154465454,the other door was the door to 1949,546456,478787870
2156485754,Mullholland Drive is the bets film ever 1939,45454545,45454545
87894454,Twin Peaks forget that stuff 1984,45454545,45454545
2145678787, 1939 good lord you are great ,787425454,878777874
Current result before year_match subroutine is incorporated:
2156485754,Mullholland Drive is the best film ever 1939,45454545,45454545
456456445,Mullholland Drive is the best film ever 1959,45454545,45454545
87894454,Twin Peaks forget that stuff 1984,45454545,45454545
456456445,The Twin Peaks forget that stuff,45454545,45454545
2145678787, 1939 good lord you are great ,787425454,878777874
454654564, 1939 hello good world you are great ,45456456, 54564654
Desired result after match_year subroutine is incorporated:
87894454,Twin Peaks forget that stuff 1984,45454545,45454545
456456445,The Twin Peaks forget that stuff,45454545,45454545
2145678787, 1939 good lord you are great ,787425454,878777874
454654564, 1939 hello good world you are great ,45456456, 54564654
I can get my head around Borodin's answer to my previous question, but as this script I'm working on is difficult to read (IMO noob opinion anyway!), I'm having trouble working out how I can incorporate this new function into it.
I review algorithm. Replaced many csv2 loops to hash of words containing list of csv2 rows numbers. Preliminary check's of years no longer required.
#!/usr/bin/perl
#use Data::Dumper;
#####CREATE CSV2 DATA
open CSV2, "<csv2" or die;
my #csv2=<CSV2>;
close CSV2;
my %words2; # $words2{lower_case_word}->{csv2_row_number}->word_count
my $i=0; # csv2 row number
my %c2year; # Years of csv2 row numbers
for(#csv2) {
chomp;
for((split /,\s*/)[1]=~/(\w+)/g) { # list words in title
$words2{lc($_)}{$i}++;
$c2year{$i}=$_ if(/^(19|20)\d\d$/);
}
$i++;
}
#print Dumper(\%words2);
###### READ CSV1 DATA
my $desired = 5; # Desired matching number of words
open CSV1, "<csv1" or die;
while (<CSV1>) {
chomp; #removes new lines
my %rows=(); # $rows{csv2_row_number} => number_of_matched_words_in_row
my $matched = 0;
my ($title) = (split /,\s*/)[1]; #/ creates variable of title
my %words;
my $year=0;
####### get words and filter it
$words{lc($_)}++ for
grep {
$year=$_ if(/^(19|20)\d\d$/); # Years present in word list
!/^(rare|vol|volume|issue|double|magazine|mag)$/i
} $title=~/(\w+)/g; #/
###### The comparison algorithm
for(keys(%words)) {
# my $word=$_; # <-- if need count words
if($words2{$_}) {
for(keys(%{$words2{$_}})) {
$rows{$_}++; # <-- OR $rows{$_}+=$words{$word} OR/AND +=$words2{$word}{$_}
}
}
}
# print Dumper(\%rows);
for(keys(%rows)) {
if ( ($rows{$_} >= $desired)
&& (!$year || !$c2year{$_} || $year==$c2year{$_} )
) {
print "$year<=>$c2year{$_} csv2: ",$csv2[$_],"\n";
++$matched;
}
}
print "csv1: $_\n\n" if $matched;
}
Uncomment use Data::Dumper and print Dumper(...) for hash's review.
If need consider count of same words, then:
###### The comparison algorithm
for(keys(%words)) {
my $W=$_;
if($words2{$_}) {
for(keys(%{$words2{$_}})) {
$rows{$_} += $words{$W} < $words2{$W}{$_} ? $words{$W} : $words2{$W}{$_};
# $words{$W} - same word count in csv1, $words2{$W}{$_} - count in csv2
}
}
}
# print Dumper(\%rows);

Perl matching multidimensional array elements

Im not getting any output, anyone get where the issue lies,
matching or calling?
(The two subarrays in the multidimensional array have the same length.)
//Multidimensional array,
//Idarray = Fasta ID, Seqarray = "ATTGTTGGT" sequences
#ordarray = (\#idarray, \#seqarray);
//This calling works
print $ordarray[0][0] , "\n";
print $ordarray[1][0] , "\n", "\n";
// Ordarray output = "TTGTGGCACATAATTTGTTTAATCCAGAT....."
User inputs a search string, loop iterates the sequence dimension,
and counts amount of matches. Prints number of matches and the corresponding ID from the ID dimension.
//The user input-searchstring
$sestri = <>;
for($r=0;$r<#idarray;$r++) {
if ($sestri =~ $ordarray[1][$r] ){
print $ordarray[0][$r] , "\n";
$counts = () = $ordarray[0][$r] =~ /$sestri/g;
print "number of counts: ", $counts ;
}
I think the problem lies with this:
$sestri = <>;
That may well not be doing what you intended - your comment says "user specified search string" but that's not what that operator does.
What it does, is open the filename you specifed on the command line, and 'return' the first line.
I would suggest that if you want to grab a search string from command line you want to do it via #ARGV
E.g.
my ( $sestri ) = #ARGV; # will give first word.
However, please please please switch on use strict and use warnings. You should always do this prior to posting on a forum for assistance.
I would also question quite why you need a two dimensional array with two elements in it though. It seems unnecessary.
Why not instead make a hash, and key your "fasta ids" to the sequence?
E.g.
my %id_of;
#id_of{#seqarray} = #idarray;
my %seq_of;
#seq_of{#id_array} = #seqarray;
I think this would suit your code a bit better, because then you don't have to worry about the array indicies at all.
use strict;
use warnings;
my ($sestri) = #ARGV;
my %id_of;
#id_of{#seqarray} = #idarray;
foreach my $sequence ( keys %id_of ) {
##NB - this is a pattern match, and will be 'true'
## if $sestri is a substring of $sequence
if ( $sequence =~ m/$sestri/ ) {
print $id_of{$sequence}, "\n";
my $count = () = $sequence =~ m/$sestri/g;
print "number of counts: ", $count, "\n";
}
}
I've rewritten it a bit, because I'm not entirely understanding what your code is doing. It looks like it's substring matching in #seqarray but then returning the count of matching elements in #idarray I don't think that makes sense, but if it does, then amend according to your needs.

Perl Script -: Useless use of array element in void context at letter_counter.pl lin 38 and 44

this is first my perl script
http://bpaste.net/show/171137/
#!/usr/bin/perl
#This program will take a user's input and then count how many letters there are. Whereupon it will count the number of unique letters before printing all the data
#back to the user.
use strict;
use warnings;
#======================================================================================================================
# This section is to collect and spit back the input to the user.
#======================================================================================================================
print "\n\nHello, please enter a word, a phrase, or a sentence. Press Enter when you are done.\n";
my $input = <>; #Collecting the input from the user.
chomp $input; #Chomping, or removing, the \n from the end of the input.
print "\nYou typed -:[$input]\n";
#======================================================================================================================
#This section will find how many unique characters there are.
#======================================================================================================================
my #uniqueArray;
my #stringArray = split(// , $input);
my $x = 0;
my $string_max_index = $#stringArray;
for($stringArray[$x];$stringArray[$string_max_index];$x++)
{
my $found = 0;
my $test = $stringArray[$x];
my $y = 0;
for($uniqueArray[$y];$uniqueArray[$#uniqueArray];$y++)
{
if($test eq $uniqueArray[$y])
{
$found=1;
}
}
if($found eq 1)
{
$uniqueArray[$#uniqueArray] = $stringArray[$x];
}
}
#======================================================================================================================
# This section will determine how many ascii characters are in the $input variable and output the results of this
# program.
#======================================================================================================================
my $numOfLet = 0;
while ( $input ne "" )
{
$numOfLet = $numOfLet + 1;
chop $input
}
print "Total Characters -: $numOfLet";
print "Total of Unique Characters -: $#uniqueArray \n\n\n";
exit;
I was able to get rid of all the errors except for these two,
Useless use of array element in void context at letter_counter.pl line 38
Useless use of array element in void context at letter_counter.pl line 44
What is confusing me is that There is nothing at those lines, just the closing brackets for my for loop, which leads me to believe that the issue is an element I called in each for loop.
The initialization block of your for loop is the immediate culprit. Adjusting to something like this resolves the warning:
for(;$stringArray[$string_max_index];$x++)
Otherwise you're accessing a value, but doing... nothing with it? That's what the warning is for.
I spot a few other problems, though:
Your for loops are... a little funny, I don't know how else to put that.
Array length is usually easiest to read with the scalar keyword.
Adding members to an array is usually done with the push keyword.
Using the above in combination:
for(my $x = 0; $x < scalar #stringArray;$x++)
{
my $found = 0;
my $test = $stringArray[$x];
my $y = 0;
for (my $y = 0; !$found && $y < scalar #uniqueArray;$y++)
{
if($test eq $uniqueArray[$y])
{
$found=1;
}
}
unless ($found)
{
push #uniqueArray, $stringArray[$x];
}
}
If the above for loops don't look sensible to you, now is a good time to look up some tutorials.
This could be simplified with foreach loops:
foreach my $letter (#stringArray) {
...
}
Or with grep searches:
my $found = grep { $_ eq $letter } #uniqueArray;
But, in the particular case of counting unique values, it's often simplest to assign to a hash:
my %uniques;
$uniques{$_} = 1 for #stringArray;
my $num_uniques = scalar keys %uniques;
Combining all of that:
my #letters = split(//, $input); #split input into array of chars
my %uniques; #declare empty hash
$uniques{$_} = 1 for #letters; #set hash key for each char
my $num_letters = scalar #letters; #count entries in letter list
my $num_uniques = scalar keys %uniques; #count unique keys in letter hash
Exercise for the reader: adjust the above code so that it counts the number of times each character is used.
That's because #uniqueArray is empty...
Given this short example:
use strict;
use warnings;
my #arr;
my $t = 0;
for ($arr[$t]; $arr[$#arr]; $t++ ) {
print "no\n";
}
__OUTPUT__
Useless use of array element in void context at t.pl line 11.
You declare my #uniqueArray; at line 21 and never do anything with it...
Which also means how will this ever match at line 34?
if($test eq $uniqueArray[$y])
Again, #uniqueArray is an empty array.
To fix your script (although please look at rutter's hash suggestion), you can do the following. Remove:
my $x = 0;
my $y = 0;
Instead of using C-style loops, replace with the following:
for my $x (0 .. $string_max_index )
for my $y (0 .. $#uniqueArray)
Lastly, use the following:
if(!$found)
{
push #uniqueArray, $stringArray[$x];
}
Hope this helps!

How to search through array elements for match in hash keys

I've an array that contains unique IDs (numeric) for DNA sequences. I've put my DNA sequences in a hash so that each key contains a descriptive header, and its value is the DNA sequence. Each header in this list contains gene information and is suffixed with its unique ID number:
Unique ID: 14272
Header(hash key): PREDICTEDXenopusSiluranatropicalishypotheticalproteinLOCLOCmRNA14272
Sequence (hash value): ATGGGTC...
I want to cycle through each Unique ID and see if it matches the number at the end of each header(hash key) and, if so, print the hash key + value into a file. So far I've got this:
my %hash;
#hash{#hash_index} = #hash_seq;
foreach $hash_index (sort keys %hash) {
for ($i=0; $i <= $#scaffoldnames; $i++) {
if ($hash_index =~ /$scaffoldnames[$i]/) {
print GENE_ID "$hash_index\n$hash{$hash_index}\n";
}
}
}
close(GENE_ID);
Whereby the unique IDs are contained within #scaffoldnames.
This doesn't work! I'm unsure as to how best to loop through both the hash and the array to find a match.
I'll expand below:
Upstream code:
foreach(#scaffoldnames) {
s/[^0-9]*//g;
} #Remove all non-numerics
my #genes = read_file('splice.txt'); #Splice.txt is a fasta file
my $hash_index = '';
my $hash_seq = '';
foreach(#genes){
if (/^>/){
my $head = $_;
$hash_index .= $head; #Collect all heads for hash
}
else {
my $sequence = $_;
$hash_seq .= $sequence; #Collect all sequences for hash
}
}
my #hash_index = split(/\n/,$hash_index); #element[0]=head1, element[1]=head2
my #hash_seq = split(/\n/, $hash_seq); #element[0]=seq1, element[1]=seq2
my %hash; # Make hash from both arrays - heads as keys, seqs as values
#hash{#hash_index} = #hash_seq;
foreach $hash_index (sort keys %hash) {
for ($i=0; $i <= $#scaffoldnames; $i++) {
if ($hash_index =~ /$scaffoldnames[$i]$/) {
print GENE_ID "$hash_index\n$hash{$hash_index}\n";
}
}
}
close(GENE_ID);
I'm trying to isolate all differently expressed genes (by unique ID) as outputted by cuffdiff (RNA-Seq) and relate them to the scaffolds (in this case expressed sequences) from which they came.
I'm hoping therefore that I can take isolate each unique ID and search through the original fasta file to pull out the header it matches and the sequence it's associated with.
You seem to have missed the point of hashes: they are used to index your data by keys so that you can access the relevant information in one step, like you can with arrays. Looping over every hash element kinda spoils the point. For instance, you wouldn't write
my $value;
for my $i (0 .. $#data) {
$value = $data[i] if $i == 5;
}
you would simply do this
my $value = $data[5];
It is hard to help properly without some more information about where your information has come from and exactly what it is you want, but this code should help.
I have used one-element arrays that I think look like what you are using, and built a hash that indexes both the header and the sequence as a two-element array, using the ID (the trailing digits of the header) as a key. The you can just look up the information for, say, ID 14272 using $hash{14272}. The header is $hash{14272}[0] and the sequence is $hash{14272}[1]
If you provide more of an indication about your circumstances then we can help you further.
use strict;
use warnings;
my #hash_index = ('PREDICTEDXenopusSiluranatropicalishypotheticalproteinLOCLOCmRNA14272');
my #hash_seq = ('ATGGGTC...');
my #scaffoldnames = (14272);
my %hash = map {
my ($key) = $hash_index[$_] =~ /(\d+)\z/;
$key => [ $hash_index[$_], $hash_seq[$_] ];
} 0 .. $#hash_index;
open my $gene_fh, '>', 'gene_id.txt' or die $!;
for my $name (#scaffoldnames) {
next unless my $info = $hash{$name};
printf $gene_fh "%s\n%s\n", #$info;
}
close $gene_fh;
Update
From the new code you have posted it looks like you can replace that section with this code.
It works by taking the trailing digits from every sequence header that it finds, and using that as a key to choose a hash element to append the data to. The hash values are the header and the sequence, all in a single string. If you have a reason for keeping them separate then please let me know.
foreach (#scaffoldnames) {
s/\D+//g;
} # Remove all non-numerics
open my $splice_fh, '<', 'splice.txt' or die $!; # splice.txt is a FASTA file
my %sequences;
my $id;
while (<$splice_fh>) {
($id) = /(\d+)$/ if /^>/;
$sequences{$id} .= $_ if $id;
}
for my $id (#scaffoldnames) {
if (my $sequence = $sequences{$id}) {
print GENE_ID $sequence;
}
}

Swap key and array value pair

I have a text file layed out like this:
1 a, b, c
2 c, b, c
2.5 a, c
I would like to reverse the keys (the number) and values (CSV) (they are separated by a tab character) to produce this:
a 1, 2.5
b 1, 2
c 1, 2, 2.5
(Notice how 2 isn't duplicated for c.)
I do not need this exact output. The numbers in the input are ordered, while the values are not. The output's keys must be ordered, as well as the values.
How can I do this? I have access to standard shell utilities (awk, sed, grep...) and GCC. I can probably grab a compiler/interpreter for other languages if needed.
If you have python (if you're on linux you probably already have) i'd use a short python script to do this. Note that we use sets to filter out "double" items.
Edited to be closer to requester's requirements:
import csv
from decimal import *
getcontext().prec = 7
csv_reader = csv.reader(open('test.csv'), delimiter='\t')
maindict = {}
for row in csv_reader:
value = row[0]
for key in row[1:]:
try:
maindict[key].add(Decimal(value))
except KeyError:
maindict[key] = set()
maindict[key].add(Decimal(value))
csv_writer = csv.writer(open('out.csv', 'w'), delimiter='\t')
sorted_keys = [x[1] for x in sorted([(x.lower(), x) for x in maindict.keys()])]
for key in sorted_keys:
csv_writer.writerow([key] + sorted(maindict[key]))
I would try perl if that's available to you. Loop through the input a row at a time. Split the line on the tab then the right hand part on the commas. Shove the values into an associative array with letters as the keys and the value being another associative array. The second associative array will be playing the part of a set so as to eliminate duplicates.
Once you read the input file, sort based on the keys of the associative array, loop through and spit out the results.
here's a small utility in php:
// load and parse the input file
$data = file("path/to/file/");
foreach ($data as $line) {
list($num, $values) = explode("\t", $line);
$newData["$num"] = explode(", ", trim($values));
}
unset($data);
// reverse the index/value association
foreach ($newData as $index => $values) {
asort($values);
foreach($values as $value) {
if (!isset($data[$value]))
$data[$value] = array();
if (!in_array($index, $data[$value]))
array_push($data[$value], $index);
}
}
// printout the result
foreach ($data as $index => $values) {
echo "$index\t" . implode(", ", $values) . "\n";
}
not really optimized or good looking, but it works...
# use Modern::Perl;
use strict;
use warnings;
use feature qw'say';
our %data;
while(<>){
chomp;
my($number,$csv) = split /\t/;
my #csv = split m"\s*,\s*", $csv;
push #{$data{$_}}, $number for #csv;
}
for my $number (sort keys %data){
my #unique = sort keys %{{ map { ($_,undef) } #{$data{$number}} }};
say $number, "\t", join ', ', #unique;
}
Here is an example using CPAN's Text::CSV module rather than manual parsing of CSV fields:
use strict;
use warnings;
use Text::CSV;
my %hash;
my $csv = Text::CSV->new({ allow_whitespace => 1 });
open my $file, "<", "file/to/read.txt";
while(<$file>) {
my ($first, $rest) = split /\t/, $_, 2;
my #values;
if($csv->parse($rest)) {
#values = $csv->fields()
} else {
warn "Error: invalid CSV: $rest";
next;
}
foreach(#values) {
push #{ $hash{$_} }, $first;
}
}
# this can be shortened, but I don't remember whether sort()
# defaults to <=> or cmp, so I was explicit
foreach(sort { $a cmp $b } keys %hash) {
print "$_\t", join(",", sort { $a <=> $b } #{ $hash{$_} }), "\n";
}
Note that it will print to standard output. I recommend just redirecting standard output, and if you expand this program at all, make sure to use warn() to print any errors, rather than just print()ing them. Also, it won't check for duplicate entries, but I don't want to make my code look like Brad Gilbert's, which looks a bit wack even to a Perlite.
Here's an awk(1) and sort(1) answer:
Your data is basically a many-to-many data set so the first step is to normalise the data with one key and value per line. We'll also swap the keys and values to indicate the new primary field, but this isn't strictly necessary as the parts lower down do not depend on order. We use a tab or [spaces],[spaces] as the field separator so we split on the tab between the key and values, and between the values. This will leave spaces embedded in the values, but trim them from before and after:
awk -F '\t| *, *' '{ for (i=2; i<=NF; ++i) { print $i"\t"$1 } }'
Then we want to apply your sort order and eliminate duplicates. We use a bash feature to specify a tab char as the separator (-t $'\t'). If you are using Bourne/POSIX shell, you will need to use '[tab]', where [tab] is a literal tab:
sort -t $'\t' -u -k 1f,1 -k 2n
Then, put it back in the form you want:
awk -F '\t' '{
if (key != $1) {
if (key) printf "\n";
key=$1;
printf "%s\t%s", $1, $2
} else {
printf ", %s", $2
}
}
END {printf "\n"}'
Pipe them altogether and you should get your desired output. I tested with the GNU tools.

Resources