How can I find the common items from two arrays fast? - arrays

I am trying to findy the common lines among two tab separated files based one field.
One line of the first file:
1 52854 s64199.1 A . . . PR GT 0/0
One line of the second file:
chr1 52854 . C T 215.302 . AB=0.692308;ABP=7.18621;AC=1;AF=0.5;AN=2;AO=9;CIGAR=1X;DP=13;DPB=13;DPRA=0;EPP=3.25157;EPPR=3.0103;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=17.5429;PAIRED=0;PAIREDR=0.25;PAO=0;PQA=0;PQR=0;PRO=0;QA=318;QR=138;RO=4;RPP=3.25157;RPPR=5.18177;RUN=1;SAF=0;SAP=22.5536;SAR=9;SRF=1;SRP=5.18177;SRR=3;TYPE=snp;technology.illumina=1;BVAR GT:DP:RO:QR:AO:QA:GL 0/1:13:4:138:9:318:-5,0,-5
Based on the second field (52854) in this example I have a much.
Here is my code which finds the common ones, but my files are quite large and takes a lot of time.
Is there any way to speed up the process?
Thank you very much in advance.
#!/app/languages/perl/5.14.2/bin/perl
use strict;
use warnings;
my $map_file = $ARGV[0];
my $vcf_file = $ARGV[1];
open my $map_info, $map_file or die "Could not open $map_file: $!";
my #map_array = ();
my #vcf_array = ();
while( my $mline = <$map_info>) {
chomp $mline;
my #data1 = split('\t', $mline);
my $pos1 = $data1[1];
push (#map_array, $pos1);
}
open my $vcf_info, $vcf_file or die "Could not open $vcf_file: $!";
while( my $line = <$vcf_info>) {
if ($line !~ m/^#/) {
push (#vcf_array, $line);
}
}
foreach my $a (#map_array) {
chomp $a;
foreach my $b (#vcf_array) {
chomp $b;
my #data = split('\t', $b);
my $pos2 = $data[1];
my $ref2 = $data[3];
my $allele = $data[4];
my $genotype = $data[9];
if ($a == $pos2) {
print $pos2 . "\t" . $ref2. "\t".$allele."\t".$genotype. "\n";
#print "$b\n";
}
}
}

Here's a version that should run much faster than your own
It reads the map file and stores each pos field in hash %wanted. Then it reads through the second file and checks whether the record is in the list of wanted values. If so then it splits the record and prints the fields you require
Note that I haven't been able to test this beyond making sure that it compiles
use strict;
use warnings;
use 5.010;
use autodie;
my ( $map_file, $vcf_file ) = #ARGV;
my %wanted;
{
open my $map_fh, '<', $map_file;
while ( <$map_fh> ) {
chomp;
my $pos = ( split /\t/, $_, 3 )[1];
++$wanted{$pos};
}
}
{
open my $vcf_fh, '<', $vcf_file;
while ( <$vcf_fh> ) {
next if /^#/;
chomp;
my $pos = ( split /\t/, $_, 3 )[1];
next unless $wanted{$pos};
my ( $ref, $allele, $genotype ) = ( split /\t/ )[3, 4, 9];
print join("\t", $pos, $ref, $allele, $genotype), "\n";
}
}

Below please find minimal modification of your script for hash based searches
use strict;
use warnings;
my $map_file = $ARGV[0];
my $vcf_file = $ARGV[1];
my %vcf_hash;
open( my $vcf_info, $vcf_file) or die "Could not open $vcf_file: $!";
while( my $line = <$vcf_info>) {
next if $line =~ m/^#/; # Skip comment lines
chomp $line;
my (#data) = split(/\t/, $line);
die unless #data >= 10; # Check number of fields in the input line
my ($pos) = $data[1];
# $. - line number in the file
$vcf_hash{$pos}{$.} = \#data;
}
open( my $map_info, $map_file) or die "Could not open $map_file: $!";
while( my $mline = <$map_info>) {
chomp $mline;
my (#data) = split(/\t/, $mline);
die unless #data >= 2; # Check number of fields in the input line
my ($pos) = $data[1];
if( exists $vcf_hash{$pos}) {
my $hash_ref = $vcf_hash{$pos};
for my $n (sort{$a<=>$b} keys %$hash_ref) {
my $array_ref = $hash_ref->{$n};
my $pos2 = $array_ref->[1];
my $ref2 = $array_ref->[3];
my $allele = $array_ref->[4];
my $genotype = $array_ref->[9];
print $pos2 . "\t" . $ref2. "\t".$allele."\t".$genotype. "\n";
}
}
}
The script may be improved further to reduce memory use if you use huge data files.

There is not a need to keep your map_file in memory but just keys. It is good to make them keys in a hash which you use for existence checking. You don't have to keep your vcf_file in memory as well, but you can just make a decision to output or not.
#!/app/languages/perl/5.14.2/bin/perl
use strict;
use warnings;
use autodie;
use constant KEY => 1;
use constant FIELDS => ( 1, 3, 4, 9 );
my ( $map_file, $vcf_file ) = #ARGV;
my %map;
{
my $fh;
open $fh, '<', $map_file;
while (<$fh>) {
$map{ ( split /\t/, $_, KEY + 2 )[KEY] } = undef;
}
}
{
my $fh;
open $fh, '<', $vcf_file;
while (<$fh>) {
next if /^#/;
chomp;
my #data = split /\t/;
print join "\t", #data[FIELDS] if exists $map{ $data[KEY] };
}
}

Related

Definition error when using columns in one file to find matching columns in another file with perl

I have a tab delimited input file in the format:
+ Chr1 www
- Chr2 zzz
...
I would like to go line by line against a reference tab delimited file (TRANSCRIPTS in the code below) in the format of:
Chr1 + xxx UsefulInfo1
Chr2 - yyy UsefulInfo2
...
And would like an output that looks like:
+ Chr1 UsefulInfo1
- Chr2 UsefulInfo2
...
Here is my attempt to take variable names from the command line, grab certain info from the input file, and append the useful info from the reference file:
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
my $inFile = $ARGV[0];
my $outFile = $ARGV[1];
open(INFILE, "<$inFile") || die("Couldn't open $inFile: $!\n");
open(OUTFILE, ">$outFile") || die("Couldn't create $outFile: $!\n");
open(TRANSCRIPTS, "</path/TranscriptInfo") || die("Couldn't open reference file!");
my #transcripts = split(/\t+/, <TRANSCRIPTS>);
chomp #transcripts;
#Define desired information from input for later
while (my #columns = split(/\t+/, <INFILE>)) {
chomp #columns;
my $strand = $columns[0];
my $chromosome = $columns[1];
#Attempt to search reference file line by line for matching criteria and copying a column of matching lines
foreach my $reference(#transcripts) {
my $refChr = $reference[0]; #Error for this line
my $refStrand = $reference[1]; #Error for this line
if ($refChr eq $chromosome && $refStrand eq $strand) {
my $info = $reference[3]; #Error for this line
print OUTFILE "$strand\t$chromosome\t\$info\n";
}
}
}
close(OUTFILE); close(INFILE);
At the moment I receive "Global symbol "#reference" requires explicit package name." What is the proper way to define this? I'm not even entirely sure my foreach loop functions as desired even once defining the symbol properly.
Fixed:
use strict;
use warnings;
use feature qw( say );
my $in_qfn = $ARGV[0];
my $out_qfn = $ARGV[1];
my $transcripts_qfn = "/path/TranscriptInfo";
my #transcripts;
{
open(my $transcripts_fh, "<", $transcripts_qfn)
or die("Can't open \"$transcripts_qfn\": $!\n");
while (<$transcripts_fh>) {
chomp;
push #transcripts, [ split(/\t/, $_, -1) ];
}
}
{
open(my $in_fh, "<", $in_qfn)
or die("Can't open \"$in_qfn\": $!\n");
open(my $out_fh, ">", $out_qfn)
or die("Can't create \"$out_qfn\": $!\n");
while (<$in_fh>) {
chomp;
my ($strand, $chr) = split(/\t/, $_, -1);
for my $transcript (#transcripts) {
my $ref_chr = $transcript->[0];
my $ref_strand = $transcript->[1];
if ($chr eq $ref_chr && $strand eq $ref_strand) {
my $info = $transcript->[2];
say $out_fh join("\t", $strand, $chr, $info);
}
}
}
}
That said, the above is very inefficient. Let's call N the number of lines in $transcript_qfn, and let's call M the number of lines in $in_qfn. The inner loop executes a number of times equal to N*M. In fact, it needs only execute N times.
use strict;
use warnings;
use feature qw( say );
my $in_qfn = $ARGV[0];
my $out_qfn = $ARGV[1];
my $transcripts_qfn = "/path/TranscriptInfo";
my %to_print;
{
open(my $in_fh, "<", $in_qfn)
or die("Can't open \"$in_qfn\": $!\n");
while (<$in_fh>) {
chomp;
my ($strand, $chr) = split(/\t/, $_, -1);
++$to_print{$strand}{$chr};
}
}
{
open(my $transcript_fh, "<", $transcript_qfn)
or die("Can't open \"$transcript_qfn\": $!\n");
open(my $out_fh, ">", $out_qfn)
or die("Can't create \"$out_qfn\": $!\n");
while (<$transcript_fh>) {
chomp;
my ($ref_chr, $ref_strand, $info) = split(/\t/, $_, -1);
next if !$to_print{$ref_strand};
next if !$to_print{$ref_strand}{$ref_chr};
say $out_fh join("\t", $ref_strand, $ref_chr, $info);
}
}

Perl retrieve index of array on regex match and print

I am looking to extract columns based off of header names in a comma (or tab) delimited file. I have a scalar variable that matches many header possibilities I named '$Acct_Name', among ones. I want to read the file(column headers), match it to what I have in '$Acct_Name' and print the matched column along with its data etc.
Here is my code:
open(FILE, "list_2.txt") or die "Cannot open file: $!";
my $Account_Name = qr/^Acct ID$|^Account No$|^Account$|^ACCOUNT NUMBER$|Account Number|Account.*?Number|^Account$|^Account #$|^Account_ID$|^Account ID$/i;
my $CLIENT = qr/^CLIENT_NAME$|^Account Long Name$|^ACCOUNT NAME$|^Account Name$|^Name$|portfolio.*?description|^Account Description$/i;
while (my $line = <FILE>) {
chomp $line;
my #array = split(/,/, $line);
my %index;
#index{#array} = (0..$#array);
my $Account_Name_ = $index{$Account_Name};
if (my ($matched) = grep $array[$_] =~ /$Account_Name/, 0..$#array) {
$Account_Name_ = $matched;
my $CLIENT_ = $index{$CLIENT};
if (my ($matched) = grep $array[$_] =~ /$CLIENT/, 0..$#array) {
$CLIENT_ = $matched;
print $array[$Account_Name_],",",$array[$CLIENT_],"\n";
}
}
}
close(FILE);
Data, list_2.txt
Account number,order_num,Name
dj870-1234,12334566,josh trust 1992
My Results
Account number,Name
Desried Out
Account number,Name
dj870-1234,josh
For some reason I am only able to print the column names based on the match. How can I grab the data as well?
You need to move your print statement to output your data lines - these do not match, so in the original code the print statement is never reached !
use warnings;
open(FILE, "list_2.txt") or die "Cannot open file: $!";
my $Account_Name = qr/^Acct ID$|^Account No$|^Account$|^ACCOUNT NUMBER$|Account Number|Account.*?Number|^Account$|^Account #$|^Account_ID$|^Account ID$/i;
my $CLIENT = qr/^CLIENT_NAME$|^Account Long Name$|^ACCOUNT NAME$|^Account Name$|^Name$|portfolio.*?description|^Account Description$/i;
my ($Account_Name_, $CLIENT_);
while (my $line = <FILE>) {
chomp $line;
my #array = split(/,/, $line);
if (my ($matched) = grep $array[$_] =~ /$Account_Name/, 0..$#array) {
$Account_Name_ = $matched;
if (my ($matched) = grep $array[$_] =~ /$CLIENT/, 0..$#array) {
$CLIENT_ = $matched;
}
}
print $array[$Account_Name_],",",$array[$CLIENT_],"\n";
}
close(FILE);

To remove duplicate elements from an array in Perl

I have a data set
10-101570715-101609901-hsa-mir-3158-1 10-101600739-101609661-ENSG00000166171 10-101588288-101609668-ENSG00000166171 10-101588325-101609447-ENSG00000166171 10-101594702-101609439-ENSG00000166171 10-101570560-101596651-ENSG00000166171
10-103389007-103396515-hsa-mir-1307 10-103389041-103396023-ENSG00000173915 10-103389050-103396074-ENSG00000173915 10-103389050-103396441-ENSG00000173915 10-103389050-103396466-ENSG00000173915 10-103389050-103396466-ENSG00000173915
Except for the first element in each line, I have multiple values, which are redundant and I want to remove the redundant values. I have written a code but I don't feel its working fine.
open (fh, "file1");
while ($line=<fh>)
{
chomp ($line);
#array=$line;
my #unique = ();
my %Seen = ();
foreach my $elem ( #array )
{
next if $Seen{ $elem }++;
push #unique, $elem;
}
print #unique;
}
a hash is for duplicate detection :
my %seen;
my #removeduplicate = grep { !$seen{$_}++ } #array;
For me below code is working fine :
use strict;
use warnings;
my %seen;
open my $fh, "<", 'file.txt' or die "couldn't open : $!";
while ( my $line = <$fh>)
{
chomp $line;
my #array = split (' ', $line);
my #removeduplicate = grep { !$seen{$_}++ } #array;
print "#removeduplicate\n";
}

Compare two hashes in perl and list which records are extra?

I have two text files that contain user records. I have to compare these two files and figure out which users are missing from File1. And delete these Orphans from file2.
#!/usr/local/bin/perl -w
use strict;
use warnings;
use autodie;
use Text::Diff;
use List::Compare;
use Data::Dumper;
my $Users1 = "Users1.txt";
my $Users2 ="Users2.txt";
my %hash1;
my %hash2;
my %new_hash;
my #sorted_1;
my #sorted_2;
my #list_keys1;
my #list_keys2;
open(my $fh1, '<:encoding(UTF-8)', $Users1) or die "Colud not open the file!";
while(my $record1 = <$fh1>)
{
chomp $record1;
my #list1 = split( '/', $record1);
foreach my $item(#list1)
{
$new_hash{$list1[1]} = $list1[0];
$hash1{$list1[1]} = $list1[0];
}
while ( my ($key, $value) = each(%hash1) ) {
push (#list_keys1, $key);
#sorted_1 = sort #list_keys1;
}
}
print "\t\tHash values for USERS1:\n";
print Dumper \%hash1;
open(my $fh2, '<:encoding(UTF-8)', $Users2) or die "Colud not open the file!";
while(my $record2 = <$fh2>)
{
chomp $record2;
my #list2 = split( '/', $record2);
foreach my $item(#list2)
{
$hash2{$list2[1]} = $list2[0];
}
while ( my ($key, $value) = each(%hash2) )
{
push (#list_keys2, $key);
#sorted_2 = sort #list_keys2;
}
}
print "\n\n\t\tHash values for Users2:\n";
print Dumper \%hash2;
#hash1{#list_keys1} = 1;
#hash2{#list_keys2} = 1;
foreach(keys %hash2)
{
print "\nThis user does not exist(to be deleted): $_\n" unless exists $hash1{$_};
}
foreach (keys %hash1)
{
print "\nNew User (to be added):$_\n" unless exists $hash2{$_};
}
close ($fh1);
close ($fh2);
Questions:
I am not able to sort the user ID (String) alphabetically(here, USER IDs are random strings of length 7). Is there any limitations when it comes to sorting array/hashes in Perl?
I am not able to compare two hashes and get the differences. What would be the most efficient way to do that?
Are there any additional libraries that I need to install in order to handle this part of code?
Sample records from file:
File1:
ASIA/ASEDF46
INDIA/PSDfT5V
CHINA/FSDfT5V
INDIA/AA44TYB
USA/BBRTT67
File 2:
INDIA/PSDfT5V
CHINA/FSDfT5V
INDIA/AA44TYB
USA/BBRTT67
UK/ZK9EELO
use strict;
use warnings;
use autodie;
open my $in, '<', 'in.txt';
open my $in2, '<', 'in_2.txt';
my (%data1, %data2);
while(<$in>){
chomp;
my #split = split/\//;
$data1{$split[0]} = $split[1];
}
while(<$in2>){
chomp;
my #split = split/\//;
$data2{$split[0]} = $split[1];
}
foreach(sort keys %data1){
print "User: $_ Value: $data1{$_}\n" if $data2{$_};
}

How do I initialize HoH from arrays of variable size

I need help figuring out how to accomodate situations in which $hash {$i} is loaded with #headers array of different sizes.
use strict;
use warnings;
my $file = "list.csv";
open (FILE,"$file") || die "Can't open file: $!\n";
my (#lines) = <FILE>;
close(FILE);
my #headers = split(',',$lines[0]);#split up header line
my %hash;
for (my $i=1; $i < scalar(#lines); $i++)
{
my #strings = split(',',$lines[$i];
# NEED help here
$hash{$i} = {
$headers[0] => $strings[0],
$headers[1] => $strings[0],
$headers[2] => $strings[0],
$headers[3] => $strings[0],
$headers[4] => $strings[0],
$headers[5] => $strings[0]
};
}
Is there a way to load up hash at index for in situations when scalar(#headers)=5,6,7 ... etc? Is there a programatic equivalent for something like...
$hash{$i} = {
$headers[0] => $strings[0],
...
$headers[n] => $strings[n]
};
or
$hash{$i} = {#headers => #strings);
The idiom you want is:
#{ $hash{$i} }{ #headers } = #strings;
This is known as slicing.
Given that you're reading CSV data you might look at some of the CPAN modules out there for this sort of thing, Text::CSV, for example.
TIMTOWTDI
#!/usr/bin/perl
use strict;
use warnings;
my $file = "list.csv";
# Use lexical filehandles, not globals; use 3-arg open; don't quote filename
open ( my $fh, '<', $file ) or die "Can't open file: $!\n";
my( #lines ) = <$fh>;
close( $fh );
# split takes a regex; also, notice the shift
my #headers = split( /,/, shift #lines );
my %hash;
# Use perly for loops here
foreach my $i ( 0..$#lines )
# This works, too
#for my $i ( 0..$#lines )
{
# split takes a regex
my #strings = split( /,/, $lines[$i] );
# One way (probably best)
#{ $hash{$i} }{ #headers } = #strings;
# Another way
#$hash{$i} = { map { $headers[$_] => $strings[$_] } ( 0 .. $#strings ) };
# Another way
#$hash{$i}{ $headers[$_] } = $strings[$_] = for(0..$#strings);
}
#use Data::Dumper;
#print Dumper \%hash;
But yes, using Text::CSV (or the faster Text::CSV_XS) would be even better than trying to manually split the CSV yourself (what happens if there are spaces? what happens if the fields and/or headers are quoted? It's a solved problem.)

Resources