match a list of identifiers to another list using perl

match a list of identifiers to another list using perl - arrays

I am stuck on how to loop a list over another list. Perhaps I am not searching for the right words but I am stuck and would appreciate any help on my code.
I went over this thread but am still getting errors running my script. How do I search a Perl array for a matching string?
the database file
chr1 1692239 1692249 AH_GARP2_comp198_c0_seq1
chr1 2233934 2233944 CS_GARP2_comp323_c0_seq1
chr1 5993434 5993444 CS_GARP2_comp635_c0_seq1
chr1 6198157 6198167 CS_GARP2_comp115_c0_seq1
chr1 6465781 6465791 JB_GARP2_comp560_c0_seq1
chr1 7827923 7827933 JB_GARP2_comp855_c0_seq1
chr1 7920939 7920949 AA_GARP2_comp614_c0_seq1
chr1 7964000 7964010 CS_GARP2_comp717_c0_seq1
chr1 9314857 9314867 AH_GARP2_comp237_c0_seq1
chr1 9654532 9654542 AH_GARP2_comp632_c0_seq1
the query file
name1 CS_GARP2_comp635_c0_seq1
name2 JB_GARP2_comp855_c0_seq1
name3 AH_GARP2_comp198_c0_seq1
name4 AH_GARP2_comp237_c0_seq1
My code
#!/usr/bin/perl5.16.2
use 5.16.2;
use lib '/users/ec1/perl5/lib/perl5/';
use warnings;
use strict;
my $filename = shift; ## database
my $filename2= shift; ## list of ids
open (DB, '<', $filename ) || die "Unable to open: $!";
open (I , '<', $filename2) || die "Unable to open: $!";
my #DB;
while (<DB>) {
chomp;
my #DB = split /\t/, $_; ## define as list
#print "#DB[0,1,2,3]\n";
}
while (my $line = <I>) {
chomp $line;
my ($name, $id) = split /\t/, $line;
if ($DB[3] =~ /$id/) {print "$DB[0]\t$DB[1]\t$DB[2]\t$DB[3]\n";
} else {print "na\n"}
}

Put DB into hash (%DB) to ease searches.
use strict; use warnings;
my $filename = shift; ## database
my $filename2= shift; ## list of ids
my %DB;
open (DB, '<', $filename ) || die "Unable to open: $!";
while (<DB>) {
chomp;
my #row = split( /[ \t]+/, $_); ## define as list
die "expected four items in db file - line $.\n" unless #row == 4; # expect four elements in a row
die "duplicate id in db file - line $." if exists $DB{$row[3]};
$DB{$row[3]} = \#row;
}
close DB;
open (I , '<', $filename2) || die "Unable to open: $!";
while (<I>) {
chomp;
my ($name, $id) = split /[ \t]+/, $_;
if(exists ($DB{$id})) {
my #row = #{$DB{$id}};
print join("\t", #row), "\n";
} else {
print "na\n"
}
}
close(I);
P.S. I have changed split pattern to ease tests of copy&paste data file content

The problem with your approach is that the my #DB inside while loop creates a lexical scope so it will always contain the last line's contents and that content will not be available outside the loop.
You should read the id file in to a hash map and check if the line for db file exists as a key in hash.
#!/usr/bin/perl
use warnings;
use strict;
use autodie;
my ($dbfile, $idfile) = #ARGV;
open my $id_fh, '<', $idfile;
open my $db_fh, '<', $dbfile;
my %ids;
while (<$id_fh>) {
++$ids{$_} for (split /\s+/)[1]; # split and put column2 as key in %ids.
}
while (<$db_fh>) {
my $fld = (split /\s+/)[3]; # split and assign column4 to $fld
print "na\n" and next unless $ids{$fld}; # print "na" if fld is not in hash and move to next line
print "$_"; # print the line if column4 exits.
}

Related

Convert string into array perl

I have a script which takes headers of a multi-fasta file and pushes them into an array. Then I want to loop through this array to find a specific pattern and perform some commands.
open(FH, '<', $ref_seq) or die $!;
while(<FH>){
$line = $_;
chomp $line;
if(m/^>([^\s]+)/){
$ref_header = $1;
print "$ref_header\n";
chomp $header;
if($1 eq $header){
$ref_header = $header;
#print "header is $ref_header\n";
}
}
}
This code prints headers like
chr1
chr2
chr3
How can I push these headers into an array?
I tried following code, but it splits individual letters, instead of $header_array[0] being chr1
#header_array = split(/\n*/, $ref_header);
print ("Here's the first element $header_array[0]");
Any help will be appreciated.

Shorten the code as shown below, removing some extra statements, and use push. You can combine push and the pattern match:
#!/usr/bin/env perl
use strict;
use warnings;
use Carp;
my $in_file = shift;
my #headers;
open my $in_fh, '<', $in_file or croak "cannot open $in_file: $!";
while ( <$in_fh> ) {
push #headers, />(\S+)/;
}
close $in_fh or croak "cannot close $in_file: $!";
print "#headers";
# Now, loop through headers and select the ones you need, for example:
for my $header ( #headers ) {
if ( $header =~ /foo/ ) {
# do something
}
}
A few suggestion on fixing your original code are below:
# Always use strict and use warnings.
# Remove extra parens and make the error message more informative:
open(FH, '<', $ref_seq) or die $!;
while(<FH>){
$line = $_;
chomp $line;
# [^\s] is simply \S:
if(m/^>([^\s]+)/){
$ref_header = $1;
print "$ref_header\n";
# where is $header coming from?
chomp $header;
# if the condition is satisfied, this assignment does not make sense:
# $ref_header is already the same as $header:
if($1 eq $header){
$ref_header = $header;
#print "header is $ref_header\n";
}
}
}

You can use push:
push #header_array, $ref_header;

Removing rows in a dataset matching a value from a separate dataset

I am having some complications with matching strings to each other.
Say I have the following table:
broken
vector
unidentified
synthetic
artificial
And I have a second dataset that looks like this:
org1 Fish
org2 Amphibian
org3 vector
org4 synthetic species
org5 Mammal
I want to remove all the rows from the second table that match the string from the first table so that the output looks like this:
org1 Fish
org2 Amphibian
org5 Mammal
I was thinking of using grep -v in bash, but I am not quite sure how to make it loop through all the strings in table 1.
I am trying to work it out in Perl, but for some reason it returns all my values instead of just the ones that match. Any idea why?
My script looks like this:
#!/bin/perl -w
($br_str, $dataset) = #ARGV;
open($fh, "<", $br_str) || die "Could not open file $br_str/n $!";
while (<$fh>) {
$str = $_;
push #strings, $str;
next;
}
open($fh2, "<", $dataset) || die "Could not open file $dataset $!/n";
while (<$fh2>) {
chomp;
#tmp = split /\t/, $_;
$groups = $tmp[1];
foreach $str(#strings){
if ($str ne $groups){
#working_lines = #tmp;
next;
}
}
print "#working_lines\n";
}

chomp your input and use a hash for your first table:
use warnings;
use strict;
my ( $br_str, $dataset ) = #ARGV;
open(my $fh, "<", $br_str ) || die "Could not open file $br_str/n $!";
my %strings;
while (<$fh>) {
chomp;
$strings{$_}++;
}
open(my $fh2, "<", $dataset ) || die "Could not open file $dataset $!/n";
while (<$fh2>) {
chomp;
my #tmp = split /\s+/, $_;
my $groups = $tmp[1];
print "$_\n" unless exists $strings{$groups};
}
Note that I used \s+ instead of \t, just to make my copy/paste easier.

Definition error when using columns in one file to find matching columns in another file with perl

I have a tab delimited input file in the format:
+ Chr1 www
- Chr2 zzz
...
I would like to go line by line against a reference tab delimited file (TRANSCRIPTS in the code below) in the format of:
Chr1 + xxx UsefulInfo1
Chr2 - yyy UsefulInfo2
...
And would like an output that looks like:
+ Chr1 UsefulInfo1
- Chr2 UsefulInfo2
...
Here is my attempt to take variable names from the command line, grab certain info from the input file, and append the useful info from the reference file:
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
my $inFile = $ARGV[0];
my $outFile = $ARGV[1];
open(INFILE, "<$inFile") || die("Couldn't open $inFile: $!\n");
open(OUTFILE, ">$outFile") || die("Couldn't create $outFile: $!\n");
open(TRANSCRIPTS, "</path/TranscriptInfo") || die("Couldn't open reference file!");
my #transcripts = split(/\t+/, <TRANSCRIPTS>);
chomp #transcripts;
#Define desired information from input for later
while (my #columns = split(/\t+/, <INFILE>)) {
chomp #columns;
my $strand = $columns[0];
my $chromosome = $columns[1];
#Attempt to search reference file line by line for matching criteria and copying a column of matching lines
foreach my $reference(#transcripts) {
my $refChr = $reference[0]; #Error for this line
my $refStrand = $reference[1]; #Error for this line
if ($refChr eq $chromosome && $refStrand eq $strand) {
my $info = $reference[3]; #Error for this line
print OUTFILE "$strand\t$chromosome\t\$info\n";
}
}
}
close(OUTFILE); close(INFILE);
At the moment I receive "Global symbol "#reference" requires explicit package name." What is the proper way to define this? I'm not even entirely sure my foreach loop functions as desired even once defining the symbol properly.

Fixed:
use strict;
use warnings;
use feature qw( say );
my $in_qfn = $ARGV[0];
my $out_qfn = $ARGV[1];
my $transcripts_qfn = "/path/TranscriptInfo";
my #transcripts;
{
open(my $transcripts_fh, "<", $transcripts_qfn)
or die("Can't open \"$transcripts_qfn\": $!\n");
while (<$transcripts_fh>) {
chomp;
push #transcripts, [ split(/\t/, $_, -1) ];
}
}
{
open(my $in_fh, "<", $in_qfn)
or die("Can't open \"$in_qfn\": $!\n");
open(my $out_fh, ">", $out_qfn)
or die("Can't create \"$out_qfn\": $!\n");
while (<$in_fh>) {
chomp;
my ($strand, $chr) = split(/\t/, $_, -1);
for my $transcript (#transcripts) {
my $ref_chr = $transcript->[0];
my $ref_strand = $transcript->[1];
if ($chr eq $ref_chr && $strand eq $ref_strand) {
my $info = $transcript->[2];
say $out_fh join("\t", $strand, $chr, $info);
}
}
}
}
That said, the above is very inefficient. Let's call N the number of lines in $transcript_qfn, and let's call M the number of lines in $in_qfn. The inner loop executes a number of times equal to N*M. In fact, it needs only execute N times.
use strict;
use warnings;
use feature qw( say );
my $in_qfn = $ARGV[0];
my $out_qfn = $ARGV[1];
my $transcripts_qfn = "/path/TranscriptInfo";
my %to_print;
{
open(my $in_fh, "<", $in_qfn)
or die("Can't open \"$in_qfn\": $!\n");
while (<$in_fh>) {
chomp;
my ($strand, $chr) = split(/\t/, $_, -1);
++$to_print{$strand}{$chr};
}
}
{
open(my $transcript_fh, "<", $transcript_qfn)
or die("Can't open \"$transcript_qfn\": $!\n");
open(my $out_fh, ">", $out_qfn)
or die("Can't create \"$out_qfn\": $!\n");
while (<$transcript_fh>) {
chomp;
my ($ref_chr, $ref_strand, $info) = split(/\t/, $_, -1);
next if !$to_print{$ref_strand};
next if !$to_print{$ref_strand}{$ref_chr};
say $out_fh join("\t", $ref_strand, $ref_chr, $info);
}
}

How can I find the common items from two arrays fast?

I am trying to findy the common lines among two tab separated files based one field.
One line of the first file:
1 52854 s64199.1 A . . . PR GT 0/0
One line of the second file:
chr1 52854 . C T 215.302 . AB=0.692308;ABP=7.18621;AC=1;AF=0.5;AN=2;AO=9;CIGAR=1X;DP=13;DPB=13;DPRA=0;EPP=3.25157;EPPR=3.0103;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=17.5429;PAIRED=0;PAIREDR=0.25;PAO=0;PQA=0;PQR=0;PRO=0;QA=318;QR=138;RO=4;RPP=3.25157;RPPR=5.18177;RUN=1;SAF=0;SAP=22.5536;SAR=9;SRF=1;SRP=5.18177;SRR=3;TYPE=snp;technology.illumina=1;BVAR GT:DP:RO:QR:AO:QA:GL 0/1:13:4:138:9:318:-5,0,-5
Based on the second field (52854) in this example I have a much.
Here is my code which finds the common ones, but my files are quite large and takes a lot of time.
Is there any way to speed up the process?
Thank you very much in advance.
#!/app/languages/perl/5.14.2/bin/perl
use strict;
use warnings;
my $map_file = $ARGV[0];
my $vcf_file = $ARGV[1];
open my $map_info, $map_file or die "Could not open $map_file: $!";
my #map_array = ();
my #vcf_array = ();
while( my $mline = <$map_info>) {
chomp $mline;
my #data1 = split('\t', $mline);
my $pos1 = $data1[1];
push (#map_array, $pos1);
}
open my $vcf_info, $vcf_file or die "Could not open $vcf_file: $!";
while( my $line = <$vcf_info>) {
if ($line !~ m/^#/) {
push (#vcf_array, $line);
}
}
foreach my $a (#map_array) {
chomp $a;
foreach my $b (#vcf_array) {
chomp $b;
my #data = split('\t', $b);
my $pos2 = $data[1];
my $ref2 = $data[3];
my $allele = $data[4];
my $genotype = $data[9];
if ($a == $pos2) {
print $pos2 . "\t" . $ref2. "\t".$allele."\t".$genotype. "\n";
#print "$b\n";
}
}
}

Here's a version that should run much faster than your own
It reads the map file and stores each pos field in hash %wanted. Then it reads through the second file and checks whether the record is in the list of wanted values. If so then it splits the record and prints the fields you require
Note that I haven't been able to test this beyond making sure that it compiles
use strict;
use warnings;
use 5.010;
use autodie;
my ( $map_file, $vcf_file ) = #ARGV;
my %wanted;
{
open my $map_fh, '<', $map_file;
while ( <$map_fh> ) {
chomp;
my $pos = ( split /\t/, $_, 3 )[1];
++$wanted{$pos};
}
}
{
open my $vcf_fh, '<', $vcf_file;
while ( <$vcf_fh> ) {
next if /^#/;
chomp;
my $pos = ( split /\t/, $_, 3 )[1];
next unless $wanted{$pos};
my ( $ref, $allele, $genotype ) = ( split /\t/ )[3, 4, 9];
print join("\t", $pos, $ref, $allele, $genotype), "\n";
}
}

Below please find minimal modification of your script for hash based searches
use strict;
use warnings;
my $map_file = $ARGV[0];
my $vcf_file = $ARGV[1];
my %vcf_hash;
open( my $vcf_info, $vcf_file) or die "Could not open $vcf_file: $!";
while( my $line = <$vcf_info>) {
next if $line =~ m/^#/; # Skip comment lines
chomp $line;
my (#data) = split(/\t/, $line);
die unless #data >= 10; # Check number of fields in the input line
my ($pos) = $data[1];
# $. - line number in the file
$vcf_hash{$pos}{$.} = \#data;
}
open( my $map_info, $map_file) or die "Could not open $map_file: $!";
while( my $mline = <$map_info>) {
chomp $mline;
my (#data) = split(/\t/, $mline);
die unless #data >= 2; # Check number of fields in the input line
my ($pos) = $data[1];
if( exists $vcf_hash{$pos}) {
my $hash_ref = $vcf_hash{$pos};
for my $n (sort{$a<=>$b} keys %$hash_ref) {
my $array_ref = $hash_ref->{$n};
my $pos2 = $array_ref->[1];
my $ref2 = $array_ref->[3];
my $allele = $array_ref->[4];
my $genotype = $array_ref->[9];
print $pos2 . "\t" . $ref2. "\t".$allele."\t".$genotype. "\n";
}
}
}
The script may be improved further to reduce memory use if you use huge data files.

There is not a need to keep your map_file in memory but just keys. It is good to make them keys in a hash which you use for existence checking. You don't have to keep your vcf_file in memory as well, but you can just make a decision to output or not.
#!/app/languages/perl/5.14.2/bin/perl
use strict;
use warnings;
use autodie;
use constant KEY => 1;
use constant FIELDS => ( 1, 3, 4, 9 );
my ( $map_file, $vcf_file ) = #ARGV;
my %map;
{
my $fh;
open $fh, '<', $map_file;
while (<$fh>) {
$map{ ( split /\t/, $_, KEY + 2 )[KEY] } = undef;
}
}
{
my $fh;
open $fh, '<', $vcf_file;
while (<$fh>) {
next if /^#/;
chomp;
my #data = split /\t/;
print join "\t", #data[FIELDS] if exists $map{ $data[KEY] };
}
}

Compare two hashes in perl and list which records are extra?

I have two text files that contain user records. I have to compare these two files and figure out which users are missing from File1. And delete these Orphans from file2.
#!/usr/local/bin/perl -w
use strict;
use warnings;
use autodie;
use Text::Diff;
use List::Compare;
use Data::Dumper;
my $Users1 = "Users1.txt";
my $Users2 ="Users2.txt";
my %hash1;
my %hash2;
my %new_hash;
my #sorted_1;
my #sorted_2;
my #list_keys1;
my #list_keys2;
open(my $fh1, '<:encoding(UTF-8)', $Users1) or die "Colud not open the file!";
while(my $record1 = <$fh1>)
{
chomp $record1;
my #list1 = split( '/', $record1);
foreach my $item(#list1)
{
$new_hash{$list1[1]} = $list1[0];
$hash1{$list1[1]} = $list1[0];
}
while ( my ($key, $value) = each(%hash1) ) {
push (#list_keys1, $key);
#sorted_1 = sort #list_keys1;
}
}
print "\t\tHash values for USERS1:\n";
print Dumper \%hash1;
open(my $fh2, '<:encoding(UTF-8)', $Users2) or die "Colud not open the file!";
while(my $record2 = <$fh2>)
{
chomp $record2;
my #list2 = split( '/', $record2);
foreach my $item(#list2)
{
$hash2{$list2[1]} = $list2[0];
}
while ( my ($key, $value) = each(%hash2) )
{
push (#list_keys2, $key);
#sorted_2 = sort #list_keys2;
}
}
print "\n\n\t\tHash values for Users2:\n";
print Dumper \%hash2;
#hash1{#list_keys1} = 1;
#hash2{#list_keys2} = 1;
foreach(keys %hash2)
{
print "\nThis user does not exist(to be deleted): $_\n" unless exists $hash1{$_};
}
foreach (keys %hash1)
{
print "\nNew User (to be added):$_\n" unless exists $hash2{$_};
}
close ($fh1);
close ($fh2);
Questions:
I am not able to sort the user ID (String) alphabetically(here, USER IDs are random strings of length 7). Is there any limitations when it comes to sorting array/hashes in Perl?
I am not able to compare two hashes and get the differences. What would be the most efficient way to do that?
Are there any additional libraries that I need to install in order to handle this part of code?
Sample records from file:
File1:
ASIA/ASEDF46
INDIA/PSDfT5V
CHINA/FSDfT5V
INDIA/AA44TYB
USA/BBRTT67
File 2:
INDIA/PSDfT5V
CHINA/FSDfT5V
INDIA/AA44TYB
USA/BBRTT67
UK/ZK9EELO

use strict;
use warnings;
use autodie;
open my $in, '<', 'in.txt';
open my $in2, '<', 'in_2.txt';
my (%data1, %data2);
while(<$in>){
chomp;
my #split = split/\//;
$data1{$split[0]} = $split[1];
}
while(<$in2>){
chomp;
my #split = split/\//;
$data2{$split[0]} = $split[1];
}
foreach(sort keys %data1){
print "User: $_ Value: $data1{$_}\n" if $data2{$_};
}