Merging two file based on columns and sorting - c

I have two files, FILE1 and FILE2, that have a different number of
columns and some columns in common. In both files the first column is
a row identifier. I want to merge the two files (FILE1 and FILE2)
without changing the order of the columns, and where there is a missing
value input the value '5'.
For example FILE1 (first column is the row ID, A1 is the first row, A2
the second, ...):
A1 1 2 5 1
A2 0 2 1 1
A3 1 0 2 2
The column names for FILE1 is (these are specified in another file),
Affy1
Affy3
Affy4
Affy5
which is to say that the value in row A1, column Affy1 is 1
and the value in row A3, column Affy5 is 2
v~~~~~ Affy3
A1 1 2 5 1
A2 0 2 1 1
A3 1 0 2 2
^~~~ Affy1
Similarly for FILE2
B1 1 2 0
B2 0 1 1
B3 5 1 1
And its column names,
Affy1
Affy2
Affy3
Meaning that
v~~~~~ Affy2
B1 1 2 0
B2 0 1 1
B3 5 1 1
^~~~ Affy1
I want to merge and sort columns based on the column names and put a
'5' for missing values. so the merged result would be as follows:
A1 1 5 2 5 1
A2 0 5 2 1 1
A3 1 5 0 2 2
B1 1 2 0 5 5
B2 0 1 1 5 5
B3 5 1 1 5 5
And the columns:
Affy1
Affy2
Affy3
Affy4
Affy5
Which is to say,
v~~~~~~~ Affy2
A1 1 5 2 5 1
A2 0 5 2 1 1
A3 1 5 0 2 2
B1 1 2 0 5 5
B2 0 1 1 5 5
B3 5 1 1 5 5
^~~~ Affy1
In reality I have over 700K columns and over 2K rows in each file. Thanks in advance!

The difficult part is ordering the headers when some of them appear only in one file. The best way I know is to build a directed graph using the Graph module and sort the elements topologically
Once that's done it's simply a matter of assigning the values from each file to the correct columns and filling the blanks with 5s
I've incorporated the headers as the first line of each data file, so this program works with this data
file1.txt
ID Affy1 Affy3 Affy4 Affy5
A1 1 2 5 1
A2 0 2 1 1
A3 1 0 2 2
file2.txt
ID Affy1 Affy2 Affy3
B1 1 2 0
B2 0 1 1
B3 5 1 1
And here's the code
consolidate_columns.pl
use strict;
use warnings 'all';
use Graph::Directed;
my #files = qw/ file1.txt file2.txt /;
# Make an array of two file handles
#
my #fh = map {
open my $fh, '<', $_ or die qq{Unable to open "$_" for input: $!};
$fh;
} #files;
# Make an array of two lists of header names
#
my #file_heads = map { [ split ' ', <$_> ] } #fh;
# Use a directed grapoh to sort all of the header names so thet they're
# still in the order that they were at the top of both files
#
my #ordered_headers = do {
my $g = Graph::Directed->new;
for my $f ( 0, 1 ) {
my $file_heads = $file_heads[$f];
$g->add_edge($file_heads->[$_], $file_heads->[$_+1]) for 0 .. $#$file_heads-1;
}
$g->topological_sort;
};
# Form a hash converting header names to column indexes for output
#
my %ordered_headers = map { $ordered_headers[$_] => $_ } 0 .. $#ordered_headers;
# Print the header and the reformed records from each file. Use the hash to
# convert the header names into column indexes
#
print "#ordered_headers\n";
for my $i ( 0 .. $#fh ) {
my $fh = $fh[$i];
my #file_heads = #{ $file_heads[$i] };
my #splice = map { $ordered_headers{$_} } #file_heads;
while ( <$fh> ) {
next unless /\S/;
my #columns;
#columns[#splice] = split;
$_ //= 5 for #columns[0 .. $#ordered_headers];
print "#columns\n";
}
}
output
ID Affy1 Affy2 Affy3 Affy4 Affy5
A1 1 5 2 5 1
A2 0 5 2 1 1
A3 1 5 0 2 2
B1 1 2 0 5 5
B2 0 1 1 5 5
B3 5 1 1 5 5

For the fun of it -- HTH
#!/usr/bin/perl
use warnings;
use strict;
use constant {A => 1, B => 2, BOTH =>3};
# I don't read data from file
my #columns = qw(Affy1 Affy2 Affy3 Affy4 Affy5);
my #locations = (BOTH, B, BOTH, A, A);
my #contentA = (["A1", 1, 2, 5, 1],
["A2", 0, 2, 1, 1],
["A3", 1, 0, 2, 2]);
my #contentB = (["B1", 1, 2, 0],
["B2", 0, 1, 1],
["B3", 5, 1, 1]);
#I assume both files have the same amount of lines
my #ares = ();
my #bres = ();
for(my $i = 0; $i < #contentA; ++$i){
# this uses a lot of memory whith huge amounts of data
# maybe you write this in two temp result files and cat them
# together at the end
# another alternative would be to iterate first over
# file A and then over file A
my #row_a = ();
my #row_b = ();
push #row_a, shift #{$contentA[$i]}; #id
push #row_b, shift #{$contentB[$i]}; #id
foreach my $loc (#locations){
if(A == $loc){
push #row_a, shift #{$contentA[$i]};
push #row_b, 5;
}
if(B == $loc){
push #row_a, 5;
push #row_b, shift #{$contentB[$i]};
}
if(BOTH == $loc){
push #row_a, shift #{$contentA[$i]};
push #row_b, shift #{$contentB[$i]};
}
}
push #ares, \#row_a;
push #bres, \#row_b;
}
foreach my $ar(#ares){
print join " ", #{$ar};
print "\n";
}
foreach my $br(#bres){
print join " ", #{$br};
print "\n";
}
print join("\n", #columns);
print "\n";

Related

TCL list data to histogram

I'm doing some data analysis, and the output is a long list of numbers. Each line consists of 1 to n numbers, which may be duplicated:
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 4
I'd like to put these into a (time-series) histogram. I'm not an expert in tcl (yet?), and I have some ideas how to do this but I have not been successful yet. The puts statements are just so I can see what's happening.
while { [gets $infile line] != -1 } {
set m [llength $line]
puts "line length $m"
foreach item $line {
puts $item
incr nc($item)
puts "nc: $nc($item)"
}
}
this nc array I've created is giving me a size-based array. However, I'd like a per-line based array (2D). Naively it would be nc($item)($nlines). I initially tried labeling the array variable with the length such as nc${item}($nlines), but I am not smart enough to get that to work.
I appreciate any help.
Best
Mike
Although Tcl arrays are one-dimensional, you can construct key strings to fake multi-dimensionality:
set lineno -1
set fh [open infile r]
while {[gets $fh line] != -1} {
incr lineno
foreach item [split [string trim $line]] {
incr nc($lineno,$item)
}
}
close $fh
# `parray` is a handy command for inspecting arrays
parray nc
outputs
nc(0,1) = 20
nc(0,2) = 8
nc(0,3) = 2
nc(0,4) = 1
nc(1,1) = 2
nc(1,2) = 4
nc(1,4) = 3
nc(2,1) = 1
nc(2,2) = 1
nc(2,3) = 1
nc(2,4) = 1
Or use dictionaries:
set lineno -1
set nc {}
set fh [open infile r]
while {[gets $fh line] != -1} {
set thisLine {}
foreach item [split [string trim $line]] {
dict incr thisLine $item
}
dict set nc [incr lineno] $thisLine
}
close $fh
dict for {line data} $nc {
puts [list $line $data]
}
outputs
0 {1 20 2 8 3 2 4 1}
1 {1 2 2 4 4 3}
2 {1 1 2 1 3 1 4 1}

how to move inside an array trasformed into an as.data.frame?

I have written the following code to declare an array as data frame:
b=as.data.frame(array(0,dim=c(NF,29,1,T+1),
dimnames=list(NULL,c(…..varnames))))
Now, I am not able to move inside the array.. for instance, if I need to show all the matrices in the following array position [,,1,1], what I need to write?
I have tried code like:
b$[].1.1
b$,1.1
b[,,1,1]"
but, of course, it does not work.
Thank you very much for your help!
from ?as.data.frame :
Arrays can be converted to data frames. One-dimensional arrays are
treated like vectors and two-dimensional arrays like matrices. Arrays
with more than two dimensions are converted to matrices by
‘flattening’ all dimensions after the first and creating suitable
column labels.
array1 <- array(1:8,dim = c(2,2,2),dimnames = split(paste0(rep(letters[1:2],each=3),1:3),1:3))
# , , 3 = a3
#
# 2
# 1 a2 b2
# a1 1 3
# b1 2 4
#
# , , 3 = b3
#
# 2
# 1 a2 b2
# a1 5 7
# b1 6 8
#
df1 <- as.data.frame(array1)
# a2.a3 b2.a3 a2.b3 b2.b3
# a1 1 3 5 7
# b1 2 4 6 8
df1$b2.a3
# [1] 3 4
I need to create the a data frame, starting from an array which dimension is (2,3,1,3):
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
Hence, the output that I need is:
debt loan stock debt loan stock debt loan stock
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
Is next code correct?
b=array(0, dim=c(3,3,1,4), dimnames=list(NULL,c("debt","loan","stock")))
output=as.data.frame(b)

How can I create two other matrices from a single mx3 matrix?

I have an mx3 matrix A containing both integer and non-integers.
A = [1.5 1 1
1 1.5 1
2 1.5 1
1.5 2 1
1 1 1.5
2 1 1.5
1 2 1.5
2 2 1.5
1.5 1 2
1 1.5 2
2 1.5 2
1.5 2 2];
What I would want is to create 2 new sets of matrices A1 and A2 such that I scan through each row of A and;
A1 = subtract 0.5 from any non-integer found in any column, and leave the integers as they are.
A2 = add 0.5 from any non-integer found in any column, and leave the integers as they are.
I would expect my final arrays to be:
A1 = [1 1 1
1 1 1
2 1 1
1 2 1
1 1 1
2 1 1
1 2 1
2 2 1
1 1 2
1 1 2
2 1 2
1 2 2];
A2 = [2 1 1
1 2 1
2 2 1
2 2 1
1 1 2
2 1 2
1 2 2
2 2 2
2 1 2
1 2 2
2 2 2
2 2 2];
if your "non-integer" numbers are only x.5 you can simply use floor and ceil:
A1 = floor(A);
A2 = ceil(A);
if it's not the case use logical indexing:
A1 = A;
A1(round(A1) ~= A1) = A1(round(A1) ~= A1) - 0.5;
A2 = A;
A2(round(A2) ~= A2) = A2(round(A2) ~= A2) + 0.5;
You can also make a condition, and depending on how you satisfy that condition either add or subtract 0.5:
cond = (rem(A3,1) ~= 0);%Generates a logical matrix
A1 = A; A2 = A;
%subtract and add 0.5 only to the elements which satisfy the condition:
A1(cond) = A1(cond) -0.5;
A2(cond) = A2(cond) +0.5;

Fetching indices of a text file from another text file

The title may not be so descriptive. Let me explain:
I have a file (Say File 1) having some numbers [delimited by a space]. see here,
1 2 3 4 5
1 2 8 4 5 6 7
1 9 3 4 5 6 7 8
..... n lines (length of each line varies).
I have another file (Say File 2) having some numbers [delimited by a tab]. see here,
1 1 1 1 1 1 0 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 0 1 1 1 1 1
1 1 1 1 1 1 0 1 1 1 1 1
..... m lines (length of each line fixed).
I want sum of 1 2 3 4 5 th (file 1 Line 1) position of file 2, line 1
I want sum of 1 2 3 4 5 6 7 th (file 1 Line 2) position of file 2, line 1 and so on.
I want linewise sum of file 2 with positions all lines in file 1
It will look like:
5 6 6 …n columns (File 1)
1 8 3
9 8 4
… m rows (File 2)
I did this by the following code:
open( FH1, "File1.txt" );
#index = <FH1>;
open( FH2, "File2.txt" );
#matrix = <FH2>;
open( OUTPUT, ">sum.txt" );
foreach $xx (#matrix) {
#k1 = split( /\t/, "$xx" );
foreach $yy (#index) {
#k2 = split( / /, "$yy" );
$ssum = 0;
foreach $zz (#k2) {
$zz1 = $zz - 1;
if ( $k1[$zz1] == 1 ) {
$ssum++;
}
}
printf OUTPUT"$ssum\t";
$ssum = 0;
}
print OUTPUT"\n";
}
close FH1;
close FH2;
close OUTPUT;
It works absolutely fine except that, the time time requirement is enormous for large files. (e.g. 1000 lines File 1 X 25000 lines File 2 : The time is 8 minutes .
My data may exceed 4 times this example. And it's unacceptable for my users.
How to accomplish this, consuming much lesser time. or by Any other concept.
Always include use strict; and use warnings; in every PERL script.
You can simplify your script by not processing the first file multiple times. Also, you coding style is very outdated. You use with some lessons from Modern Perl Book by chromatic.
The following is your script simplified to take advantage of more modern style and techniques. Note, that it currently loads the file data from inside the script instead of external sources:
use strict;
use warnings;
use autodie;
use List::Util qw(sum);
my #indexes = do {
#open my $fh, '<', "File1.txt";
open my $fh, '<', \ "1 2 3 4 5\n1 2 8 4 5 6 7\n1 9 3 4 5 6 7 8\n";
map { [map {$_ - 1} split ' '] } <$fh>
};
#open my $infh, '<', "File2.txt";
my $infh = \*DATA;
#open my $outfh, '>', "sum.txt";
my $outfh = \*STDOUT;
while (<$infh>) {
my #vals = split ' ';
print $outfh join(' ', map {sum(#vals[#$_])} #indexes), "\n";
}
__DATA__
1 1 1 1 1 1 0 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 0 1 1 1 1 1
1 1 1 1 1 1 0 1 1 1 1 1
Outputs:
5 6 7
5 7 8
5 6 7
5 6 7

Matlab: Update elements in array by identifier

I have a large array A with ~500.000 rows of the form
[ id1 id2 value1 value2 zero zero ]
and another, smaller Array B (~20.000 rows) with rows with some of the identifiers from A
[ id1 id2 value3 value4 ]
All the pairs of IDs in B exist in A. I want to update the values of B into A at the positions where respectively the values of id1 and id2 match. The (row-)order of the new array may be arbitraty.
An example:
A = 1 1 3 5 0 0
1 2 6 4 0 0
1 3 3 1 0 0
2 1 3 8 0 0
3 4 0 2 0 0
B = 2 1 7 4
1 1 2 1
should yield
C = 1 1 3 5 2 1
1 2 6 4 0 0
1 3 3 1 0 0
2 1 3 8 7 4
3 4 0 2 0 0
Iterating through A for each element in B with for loops takes incredibly long. I hope there is a faster way.
You can use ismember to obtain the indices of the rows where "id1" and "id2" match, and then update the last two columns with the corresponding values from B:
C = A;
[tf, loc] = ismember(B(:, 1:2), A(:, 1:2), 'rows');
C(loc, 5:6) = B(:, 3:4);

Resources