Checking for Duplicates in array - arrays

What's going on:
I've ssh'd onto my localhost, ls the desktop and taken those items and put them into an array.
I hardcoded a short list of items and I am comparing them with a hash to see if anything is missing from the host (See if something from a is NOT in b, and let me know).
So after figuring that out, when I print out the "missing files" I get a bunch of duplicates (see below), not sure if that has to do with how the files are being checked in the loop, but I figured the best thing to do would be to just sort out the data and eliminate dupes.
When I do that, and print out the fixed data, only one file is printing, two are missing.
Any idea why?
#!/usr/bin/perl
my $hostname = $ARGV[0];
my #hostFiles = ("filecheck.pl", "hostscript.pl", "awesomeness.txt");
my #output =`ssh $hostname "cd Desktop; ls -a"`;
my %comparison;
for my $file (#hostFiles) {
$comparison{$file} +=1;
}
for my $file (#output) {
$comparison{$file} +=2
}
for my $file (sort keys %comparison) {
#missing = "$file\n" if $comparison{$file} ==1;
#print "Extra file: $file\n" if $comparison{$file} ==2;
print #missing;
}
my #checkedMissingFiles;
foreach my $var ( #missing ){
if ( ! grep( /$var/, #checkedMissingFiles) ){
push( #checkedMissingFiles, $var );
}
}
print "\n\nThe missing Files without dups:\n #checkedMissingFiles\n";
Password:
awesomeness.txt ##This is what is printing after comparing the two arrays
awesomeness.txt
filecheck.pl
filecheck.pl
filecheck.pl
hostscript.pl
hostscript.pl
The missing Files without dups: ##what prints after weeding out duplicates
hostscript.pl

The perl way of doing this would be:
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my %hostFiles = qw( filecheck.pl 1 hostscript.pl 1 awesomeness.txt 1);
# ssh + backticks + ls, not the greatest way to do this, but that's another Q
my #files =`ssh $ARGV[0] "ls -a ~/Desktop"`;
# get rid of the newlines
chomp #files;
#grep returns the matching element of #files
my %existing = map { $_ => 1} grep {exists($hostFiles{$_})} #files;
print Dumper([grep { !exists($existing{$_})} keys %hostFiles]);
Data::Dumper is a utility module, I use it for debugging or demonstrative purposes.
If you want print the list you can do something like this:
{
use English;
local $OFS = "\n";
local $ORS = "\n";
print grep { !exists($existing{$_})} keys %hostFiles;
}
$ORS is the output record separator (it's printed after any print) and $OFS is the output field separator which is printed between the print arguments. See perlvar. You can get away with not using "English", but the variable names will look uglier. The block and the local are so you don't have to save and restore the values of the special variables.
If you want to write to a file the result something like this would do:
{
use English;
local $OFS = "\n";
local $ORS = "\n";
open F, ">host_$ARGV[0].log";
print F grep { !exists($existing{$_})} keys %hostFiles;
close F;
}
Of course, you can also do it the "classical" way, loop trough the array and print each element:
open F, ">host_$ARGV[0].log";
for my $missing_file (grep { !exists($existing{$_})} keys %hostFiles) {
use English;
local $ORS = "\n";
print F "File is missing: $missing_file"
}
close F;
This allows you to do more things with the file name, for example, you can SCP it over to the host.

It seems to me that looping over the 'required' list makes more sense - looping over the list of existing files isn't necessary unless you're looking for files that exist but aren't needed.
#!/usr/bin/perl
use strict;
use warnings;
my #hostFiles = ("filecheck.pl", "hostscript.pl", "awesomeness.txt");
my #output =`ssh $ARGV[0] "cd Desktop; ls -a"`;
chomp #output;
my #missingFiles;
foreach (#hostFiles) {
push( #missingFiles, $_ ) unless $_ ~~ #output;
}
print join("\n", "Missing files: ", #missingFiles);

#missing = "$file\n" assigns the array #missing to contain a single element, "$file\n". It does this every loop, leaving it with the last missing file.
What you want is push(#missing, "$file\n").

Related

Empty array in a perl while loop, should have input

Was working on this script when I came across a weird anomaly. When I go to print #extract after declaring it, it prints correctly the following:
------MMMMMMMMMMMMMMMMMMMMMMMMMM-M-MMMMMMMM
------SSSSSSSSSSSSSSSSSSSSSSSSSS-S-SSSSSDTA
------TIIIIIIIIIIIIITIIIVVIIIIII-I-IIIIITTT
Now the weird part, when I then try to print or return #extract (or $column) inside of the while loop, it comes up empty, thus rendering the rest of the script useless. I've never come across this before up until now, haven't been able to find any documentation or people with similar problems as mine. Below is the code, I marked with #<------ where the problems are and are not, to see if anyone can have any idea what is going on? Thank you kindly.
P.S. I am utilizing perl version 5.12.2
use strict;
use warnings;
#use diagnostics;
#use feature qw(say);
open (S, "Val nuc align.txt") || die "cannot open FASTA file to read: $!";
open (OUTPUT, ">output.txt");
my #extract;
my $sum = 0;
my #lines = <S>;
my #seq = ();
my $start = 0; #amino acid column start
my $end = 10; #amino acid column end
#Removing of the sequence tag until amino acid sequence composition (from >gi to )).
foreach my $line (#lines) {
$line =~ s/\n//g;
if ($line =~ />/g) {
$line =~ s/>.*\]/>/g;
push #seq, $line;
}
else {
push #seq, $line;
}
}
my $seq = join ('', #seq);
my #seq_prot = join "\n", split '>', $seq;
#seq_prot = grep {/[A-Z]/} #seq_prot;
#number of sequences
print OUTPUT "Number of sequences:", scalar (grep {defined} #seq_prot), "\n";
#selection of amino acid sequence. From $start to $end.
my #vertical_array;
while ( my $line = <#seq_prot> ) {
chomp $line;
my #split_line = split //, $line;
for my $index ( $start..$end ) { #AA position, extracts whole columns
$vertical_array[$index] .= $split_line[$index];
}
}
# Print out your vertical lines
for my $line ( #vertical_array ) {
my $extract = say OUTPUT for unpack "(a200)*", $line; #split at end of each column
#extract = grep {defined} $extract;
}
print OUTPUT #extract; #<--------------- This prints correctly the input
#Count selected amino acids excluding '-'.
my %counter;
while (my $column = #extract) {
print #extract; #<------------------------ Empty print, no input found
}
Update: Found the main problem to be with the unpack command, I thought I could utilize it to split my columns of my input at X elements (43 in this case). While this works, the minute I change $start to another number that is not 0 (say 200), the code brings up errors. Probably has something to do with the number of column elements does not match the lines. Will keep updated.
Write your last while loop the same way as your previous for loop. The assignment
my $column = #extract
is in scalar context, which does not give you the same result as:
for my $column (#extract)
Instead, it will give you the number of elements in the array. Try this second option and it should work.
However, I still have a concern, because in fact, if #extract had anything in it, you would obtain an infinite loop. Is there any code that you did not include between your two commented lines?

grepping command line arguments out of an array in perl

I have a file that looks like this:
[options42BuySide]
logged-check-times=06:01:00
logged-check-address=192.168.3.4
logged-check-reply=192.168.2.5
logged-check-vac-days=sat,sun
start-time=06:01:00
stop-time=19:00:00
falldown=logwrite after 10000
failtolog=logwrite after 10000
listento=all
global-search-text=Target Down. This message is stored;
[stock42BuySide]
logged-check-times=06:01:00
logged-check-address=192.168.2.13
logged-check-reply=192.168.2.54
logged-check-vac-days=sat,sun
start-time=06:01:00
stop-time=18:00:00
The script grinds the list down to just the name, start and stop time.
sellSide40, start-time=07:05:00, stop-time=17:59:00
SellSide42, start-time=07:06:00, stop-time=17:29:00
SellSide44, start-time=07:31:00, stop-time=16:55:00
42SellSide, start-time=09:01:00, stop-time=16:59:00
The problem is that I would like to filter out specific names from the file with comand line parameters.
I am trying to use the #ARGV array and grep the command line values out of the #nametimes array. Something like :
capser#capser$ ./get_start_stop SorosSellSide42 ETFBuySide42
The script works fine for parsing the file - I just need help on the command line array
#!/usr/bin/perl
use strict ;
use warnings ;
my ($name , $start, $stop, $specific);
my #nametimes;
my $inifile = "/var/log/Alert.ini";
open ( my $FILE, '<', "$inifile") or die ("could not open the file -- $!");
while(<$FILE>) {
chomp ;
if (/\[(\w+)\]/) {
$name = $1;
} elsif (/(start-time=\d+:\d+:\d+)/) {
$start = $1;
} elsif (/(stop-time=\d+:\d+:\d+)/) {
$stop = $1;
push (#nametimes, "$name, $start, $stop");
}
}
for ($a = 0; $a >= $#ARGV ; $a++) {
$specific = (grep /$ARGV[$a]/, #nametimes) ;
print "$specific\n";
}
It is probably pretty easy - however I have worked on it for days, and I am the only guy that uses perl in this shop. I don't have anyone to ask and the googlize is not panning out. I apologize in advance for angering the perl deities who are sure to yell at me for asking such and easy question.
Your construct for looping over #ARGV is a bit unwieldy - the more common way of doing that would be:
for my $name (#ARGV) {
#do something
}
But really, you don't even need to loop over it. You can just join them all directly into a single regular expression:
my $names = join("|", #ARGV);
my #matches = grep { /\b($names)\b/ } #nametimes;
I've used \b in the regex here - that indicates a word boundary, so the argument SellSide4 wouldn't match SellSide42. That may or may not be what you want...
Use an array to store the results from the grep(), not a scalar. Push them, not assign. Otherwise the second iteration of the for loop will overwrite results. Something like:
for my $el ( #ARGV ) {
push #specific, grep { /$el/ } #nametimes);
};
print join "\n", #specific;
The easiest thing to do is to store your INI file as a structure. Then, you can go through your structure and pull out what you want. The simplest structure would be a hash of hashes. Where your heading is the key to the outer hash, and the inner hash is keyed by the parameter:
Here's is creating the basic structure:
use warnings;
use strict;
use autodie;
use feature qw(say);
use Data::Dumper;
use constant INI_FILE => "test.file.txt";
open my $ini_fh, "<", INI_FILE;
my %ini_file;
my $heading;
while ( my $line = <$ini_fh> ) {
chomp $line;
if ( $line =~ /\[(.*)\]/ ) { #Headhing name
$heading = $1;
}
elsif ( $line =~ /(.+?)\s*=\s*(.+)/ ) {
my $parameter = $1;
my $value = $2;
$ini_file{$heading}->{$parameter} = $value;
}
else {
say "Invalid line $. - $line";
}
}
After this, the structure will look like this:
$VAR1 = {
'options42BuySide' => {
'stop-time' => '19:00:00',
'listento' => 'all',
'logged-check-reply' => '192.168.2.5',
'logged-check-vac-days' => 'sat,sun',
'falldown' => 'logwrite after 10000',
'start-time' => '06:01:00',
'logged-check-address' => '192.168.3.4',
'logged-check-times' => '06:01:00',
'failtolog' => 'logwrite after 10000',
'global-search-text' => 'Target Down. This message is stored;'
},
'stock42BuySide' => {
'stop-time' => '18:00:00',
'start-time' => '06:01:00',
'logged-check-reply' => '192.168.2.54',
'logged-check-address' => '192.168.2.13',
'logged-check-vac-days' => 'sat,sun',
'logged-check-times' => '06:01:00'
}
};
Now, all you have to do is parse your structure and pull the information you want out of it:
for my $heading ( sort keys %ini_file ) {
say "$heading " . $ini_file{$heading}->{"start-time"} . " " . $ini_file{$heading}->{"stop-time"};
}
You could easily modify this last loop to skip the headings you want, or to print out the exact parameters you want.
I would also recommend using Getopt::Long to parse your command line parameters:
my_file -include SorosSellSide42 -include ETFBuySide42 -param start-time -param stop-time
Getopt::Long could store your parameters in arrays. For example. It would put all the -include parameters in an #includes array and all the -param parameters in an #parameters array:
for my $heading ( #includes ) {
print "$heading ";
for my $parameter ( #parameters ) {
print "$ini_file{$heading}->{$parameter} . " ";
}
print "\n;
}
Of course, there needs to be lots of error checking (does the heading exist? What about the requested parameters?). But, this is the basic structure. Unless your file is extremely long, this is probably the easiest way to process it. If your file is extremely long, you could use the #includes and #parameters in the first loop as you read in the parameters and headings.

Missmatch in array comparison error message ' Argument "" isn't numeric in array element at'

I am very new to perl and am struggling to get this script to work.
I have taken pieces or perl and gooten them to work as indivual sections but upon trying to blend them together it fails. Even with the error messages that show up I can not find where my mistake is.
The script when working and completed will read an output file and go through it section my section and utilmately generate a new output file with not much more the a heading with some additional text and a value of the amount of lines in that section.
My issues are when it does the looping for each keyword in the array it is now failing with the error message 'Argument "" isn't numeric in array element at'. Perl directs me to a section in the script but I can not see how I am calling the element incorrectly. All the elements in the array are alpha yet the error message is refering to a numeric value.
Can anyone see my mistake.
Thank you
Here is the script
#!/usr/bin/perl -w
use strict;
use warnings;
use diagnostics;
# this version reads each variable and loops through the 18 times put only displays on per loop.
my $NODE = `uname -n`;
my $a = "/tmp/";
my $b = $NODE ;
my $c = "_deco.txt";
my $d = "_deco_mini.txt";
chomp $b;
my $STRING = "$a$b$c";
my $STRING_out = "$a$b$d";
my #keyword = ( "Report", "Last", "HP", "sulog", "sudo", "eTrust", "proftp", "process", "active clusters", "pdos", "syslog", "BNY", "syslogmon", "errpt", "ports", "crontab", "NFS", "scripts", "messages");
my $i = 0;
my $keyword="";
my $x=0;
my $y=0;
my $jw="";
my $EOS = "########################################################################";
my $qty_lines=0;
my $skip5=0;
my $skipcnt=0;
my $keeplines=0;
my #HPLOG="";
do {
print "Reading File: [$STRING]\n";
if (-e "$STRING" && open (IN, "$STRING")) {
# ++$x; # proving my loop worked
# print "$x interal loop counter\n"; # proving my loop worked
for ( ++$i) { # working
while ( <IN> ) {
chomp ;
#if ($_ =~ /$keyword/) {
#if ($_ =~ / $i /) {
#if ($_ =~ /$keyword[ $i ]/) {
if ($_ =~ /$keyword $i/) {
print " $i \n";
$skip5=1;
next;
# print "$_\n";# $ not initalized error when tring to use it
}
if ($skip5) {
$skipcnt++;
print "SKIP LINE: $_\n";
print "Header LINE: $_\n";
next if $skipcnt <= 5;
$skip5=0;
$keeplines=1;
}
if ($keeplines) {
# ++$qty_lines; # for final output
last if $_ =~ /$EOS/;
print "KEEP LINE: $_\n";
# print "$qty_lines\n"; # for final output
push #HPLOG, "$_\n";
# push #HPLOG, "$qty_lines\n";# for final output
}
} ## end while ( <IN> )
} ## end for ( ++$i)
} ## end if (-e "$STRING" && open (IN, "$STRING"))
close (IN);
} while ( $i < 19 && ++$y < 18 );
Here is a sample section or the input file.
###############################################################################
Checking for active clusters.
#########
root 11730980 12189848 0 11:24:20 pts/2 0:00 egrep hagsd|harnad|HACMP|haemd
If there are any processes listed you need to remove the server from the cluster.
############################################################################
This is the output from Pdos log
Please review it for anything that looks like a users may be trying to run something.
#########
This server is not on Tamos
############################################################################
This is the output from syslog.conf.
Look for any entries on the right side column that are not the ususal logs or location.
#########
# #(#)34 1.11 src/bos/etc/syslog/syslog.conf, cmdnet, bos610 4/27/04 14:47:53
# IBM_PROLOG_BEGIN_TAG
# This is an automatically generated prolog.
#
# bos610 src/bos/etc/syslog/syslog.conf 1.11
I truncated the rest of the file
Can anyone see my mistake.
I can see quite a lot of mistakes. But I also see some good stuff like use strict and use warnings.
My suggestion for you is to work on your coding style so that it gets easier for you and others to debug any problems.
Naming variables
my $NODE = `uname -n`;
my $a = "/tmp/";
my $b = $NODE ;
my $c = "_deco.txt";
my $d = "_deco_mini.txt";
chomp $b;
my $STRING = "$a$b$c";
my $STRING_out = "$a$b$d";
Why are some of those names all uppercase and others all lower case? If you are building up a filename, why do you call the variable that holds the filename $STRING?
my #keyword = ( "Report", "Last", "HP", "sulog", "sudo", ....
If you have a list of several keywords, wouldn't it be apt to not chose a singular for the variable name? How about #keywords?
Using temporary variables you don't need
my $NODE = `uname -n`;
my $a = "/tmp/";
my $b = $NODE ;
my $c = "_deco.txt";
chomp $b;
my $STRING = "$a$b$c";
Why do you need $a, $b and $c? The (forgive me) stupid names of those vars are a tell-tale sign that you don't need them. How about this instead?
my $node_name = `uname -n`;
chomp $node_name;
my $file_name = sprintf '/tmp/%s/_deco.txt', $node_name;
Your biggest problem: you have no idea how to use arrays
You are making several drastic mistakes when it comes to arrays.
my #HPLOG="";
Do you want an array or another string? The # says array, the "" says string. I guess you wanted a new, empty array, so my #hplog = () would have been much better. But since there is no need to tell perl that you want an empty array as it will give you an empty one anyway, my #hplog; will do the job just fine.
It took me a while to figure out this next one and I'm still not sure whether I'm guessing your intentions correctly:
my #keyword = ( "Report", "Last", "HP", "sulog", "sudo", "eTrust", "proftp", "process", "active clusters", "pdos", "syslog", "BNY", "syslogmon", "errpt", "ports", "crontab", "NFS", "scripts", "messages");
...
if ($_ =~ /$keyword $i/) {
What I think you are doing here is trying to match your current input line against element number $i in #keywords. If my assumption is correct, you really wanted to say this:
if ( /$keyword[ $i ]/ ) {
Iterating arrays
Perl is not C. It doesn't make you jump through hoops to get a loop.
Just look at all the code you wrote to loop through your keywords:
my $i = 0;
...
for ( ++$i) { # working
...
if ($_ =~ /$keyword $i/) {
...
} while ( $i < 19 && ++$y < 18 );
Apart from the facts that your working comment is just self-deception and that you hard-coded the number of elements in your array, you could have just used a for-each loop:
foreach my $keyword ( #keywords ) {
# more code here
}
I'm sure that when you try to work on the above list, the problem that made you ask here will just go away. Have fun.

How can I extract just the elements I want from a Perl array?

Hey I'm wondering how I can get this code to work. Basically I want to keep the lines of $filename as long as they contain the $user in the path:
open STDERR, ">/dev/null";
$filename=`find -H /home | grep $file`;
#filenames = split(/\n/, $filename);
for $i (#filenames) {
if ($i =~ m/$user/) {
#keep results
} else {
delete $i; # does not work.
}
}
$filename = join ("\n", #filenames);
close STDERR;
I know you can delete like delete $array[index] but I don't have an index with this kind of loop that I know of.
You could replace your loop with:
#filenames = grep /$user/, #filenames;
There's no way to do it when you're using foreach loop. But nevermind. The right thing to do is to use File::Find to accomplish your task.
use File::Find 'find';
...
my #files;
my $wanted = sub {
return unless /\Q$file/ && /\Q$user/;
push #files, $_;
};
find({ wanted => $wanted, no_chdir => 1 }, '/home');
Don't forget to escape your variables with \Q for use in regular expressions.
BTW, redirecting your STDERR to /dev/null is better written as
{
local *STDERR;
open STDERR, '>', '/dev/null';
...
}
It restores the filehandle after exiting the block.
If you have a find that supports -path, then make it do the work for you, e.g.,
#! /usr/bin/perl
use warnings;
use strict;
my $user = "gbacon";
my $file = "bash";
my $path = join " -a " => map "-path '*$_*'", $user, $file;
chomp(my #filenames = `find -H /home $path 2>/dev/null`);
print map "$_\n", #filenames;
Note how backticks in list context give back a list of lines (including their terminators, removed above with chomp) from the command's output. This saves you having to split them yourself.
Output:
/home/gbacon/.bash_history
/home/gbacon/.bashrc
/home/gbacon/.bash_logout
If you want to remove an item from an array, use the multi-talented splice function.
my #foo = qw( a b c d e f );
splice( #foo, 3, 1 ); # Remove element at index 3.
You can do all sorts of other manipulations with splice. See the perldoc for more info.
As codeholic alludes to, you should never modify an array while iterating over it with a for loop. If you want to modify an array while iterating, use a while loop instead.
The reason for this is that for evaluates the expression in parens once, and maps each item in the result list to an alias. If the array changes, the pointers get screwed up and chaos will follow.
A while evaluates the condition each time through the loop, so you won't run into issues with pointers to non-existent values.

Swap key and array value pair

I have a text file layed out like this:
1 a, b, c
2 c, b, c
2.5 a, c
I would like to reverse the keys (the number) and values (CSV) (they are separated by a tab character) to produce this:
a 1, 2.5
b 1, 2
c 1, 2, 2.5
(Notice how 2 isn't duplicated for c.)
I do not need this exact output. The numbers in the input are ordered, while the values are not. The output's keys must be ordered, as well as the values.
How can I do this? I have access to standard shell utilities (awk, sed, grep...) and GCC. I can probably grab a compiler/interpreter for other languages if needed.
If you have python (if you're on linux you probably already have) i'd use a short python script to do this. Note that we use sets to filter out "double" items.
Edited to be closer to requester's requirements:
import csv
from decimal import *
getcontext().prec = 7
csv_reader = csv.reader(open('test.csv'), delimiter='\t')
maindict = {}
for row in csv_reader:
value = row[0]
for key in row[1:]:
try:
maindict[key].add(Decimal(value))
except KeyError:
maindict[key] = set()
maindict[key].add(Decimal(value))
csv_writer = csv.writer(open('out.csv', 'w'), delimiter='\t')
sorted_keys = [x[1] for x in sorted([(x.lower(), x) for x in maindict.keys()])]
for key in sorted_keys:
csv_writer.writerow([key] + sorted(maindict[key]))
I would try perl if that's available to you. Loop through the input a row at a time. Split the line on the tab then the right hand part on the commas. Shove the values into an associative array with letters as the keys and the value being another associative array. The second associative array will be playing the part of a set so as to eliminate duplicates.
Once you read the input file, sort based on the keys of the associative array, loop through and spit out the results.
here's a small utility in php:
// load and parse the input file
$data = file("path/to/file/");
foreach ($data as $line) {
list($num, $values) = explode("\t", $line);
$newData["$num"] = explode(", ", trim($values));
}
unset($data);
// reverse the index/value association
foreach ($newData as $index => $values) {
asort($values);
foreach($values as $value) {
if (!isset($data[$value]))
$data[$value] = array();
if (!in_array($index, $data[$value]))
array_push($data[$value], $index);
}
}
// printout the result
foreach ($data as $index => $values) {
echo "$index\t" . implode(", ", $values) . "\n";
}
not really optimized or good looking, but it works...
# use Modern::Perl;
use strict;
use warnings;
use feature qw'say';
our %data;
while(<>){
chomp;
my($number,$csv) = split /\t/;
my #csv = split m"\s*,\s*", $csv;
push #{$data{$_}}, $number for #csv;
}
for my $number (sort keys %data){
my #unique = sort keys %{{ map { ($_,undef) } #{$data{$number}} }};
say $number, "\t", join ', ', #unique;
}
Here is an example using CPAN's Text::CSV module rather than manual parsing of CSV fields:
use strict;
use warnings;
use Text::CSV;
my %hash;
my $csv = Text::CSV->new({ allow_whitespace => 1 });
open my $file, "<", "file/to/read.txt";
while(<$file>) {
my ($first, $rest) = split /\t/, $_, 2;
my #values;
if($csv->parse($rest)) {
#values = $csv->fields()
} else {
warn "Error: invalid CSV: $rest";
next;
}
foreach(#values) {
push #{ $hash{$_} }, $first;
}
}
# this can be shortened, but I don't remember whether sort()
# defaults to <=> or cmp, so I was explicit
foreach(sort { $a cmp $b } keys %hash) {
print "$_\t", join(",", sort { $a <=> $b } #{ $hash{$_} }), "\n";
}
Note that it will print to standard output. I recommend just redirecting standard output, and if you expand this program at all, make sure to use warn() to print any errors, rather than just print()ing them. Also, it won't check for duplicate entries, but I don't want to make my code look like Brad Gilbert's, which looks a bit wack even to a Perlite.
Here's an awk(1) and sort(1) answer:
Your data is basically a many-to-many data set so the first step is to normalise the data with one key and value per line. We'll also swap the keys and values to indicate the new primary field, but this isn't strictly necessary as the parts lower down do not depend on order. We use a tab or [spaces],[spaces] as the field separator so we split on the tab between the key and values, and between the values. This will leave spaces embedded in the values, but trim them from before and after:
awk -F '\t| *, *' '{ for (i=2; i<=NF; ++i) { print $i"\t"$1 } }'
Then we want to apply your sort order and eliminate duplicates. We use a bash feature to specify a tab char as the separator (-t $'\t'). If you are using Bourne/POSIX shell, you will need to use '[tab]', where [tab] is a literal tab:
sort -t $'\t' -u -k 1f,1 -k 2n
Then, put it back in the form you want:
awk -F '\t' '{
if (key != $1) {
if (key) printf "\n";
key=$1;
printf "%s\t%s", $1, $2
} else {
printf ", %s", $2
}
}
END {printf "\n"}'
Pipe them altogether and you should get your desired output. I tested with the GNU tools.

Resources