Removing rows in a dataset matching a value from a separate dataset - loops

I am having some complications with matching strings to each other.
Say I have the following table:
broken
vector
unidentified
synthetic
artificial
And I have a second dataset that looks like this:
org1 Fish
org2 Amphibian
org3 vector
org4 synthetic species
org5 Mammal
I want to remove all the rows from the second table that match the string from the first table so that the output looks like this:
org1 Fish
org2 Amphibian
org5 Mammal
I was thinking of using grep -v in bash, but I am not quite sure how to make it loop through all the strings in table 1.
I am trying to work it out in Perl, but for some reason it returns all my values instead of just the ones that match. Any idea why?
My script looks like this:
#!/bin/perl -w
($br_str, $dataset) = #ARGV;
open($fh, "<", $br_str) || die "Could not open file $br_str/n $!";
while (<$fh>) {
$str = $_;
push #strings, $str;
next;
}
open($fh2, "<", $dataset) || die "Could not open file $dataset $!/n";
while (<$fh2>) {
chomp;
#tmp = split /\t/, $_;
$groups = $tmp[1];
foreach $str(#strings){
if ($str ne $groups){
#working_lines = #tmp;
next;
}
}
print "#working_lines\n";
}

chomp your input and use a hash for your first table:
use warnings;
use strict;
my ( $br_str, $dataset ) = #ARGV;
open(my $fh, "<", $br_str ) || die "Could not open file $br_str/n $!";
my %strings;
while (<$fh>) {
chomp;
$strings{$_}++;
}
open(my $fh2, "<", $dataset ) || die "Could not open file $dataset $!/n";
while (<$fh2>) {
chomp;
my #tmp = split /\s+/, $_;
my $groups = $tmp[1];
print "$_\n" unless exists $strings{$groups};
}
Note that I used \s+ instead of \t, just to make my copy/paste easier.

Related

How to count the number of keys that exist in a hash?

I am working with an input file that contains tab delimitated sequences. Groups of sequences are separated by line breaks. The file looks like:
TAGC TAGC TAGC HELP
TAGC TAGC TAGC
TAGC HELP
TAGC
Here is the code I have:
use strict;
use warnings;
open(INFILE, "<", "/path/to/infile.txt") or die $!;
my %hash = (
TAGC => 'THIS_EXISTS',
GCTA => 'THIS_DOESNT_EXIST',
);
while (my $line = <INFILE>){
chomp $line;
my $hash;
my #elements = split "\t", $line;
open my $out, '>', "/path/to/outfile.txt" or die $!;
foreach my $sequence(#elements){
if (exists $hash{$sequence}){
print $out ">$sequence\n$hash{$sequence}\n";
}
else
}
$count++;
print "Doesn't exist ", $count, "\n";
}
}
}
How can I tell how many sequences exist before I print? I need to put that information into the name of the output file.
Ideally, I would have a variable that I could include in the name of the file. Unfortunately, I can't just take the scalar of #elements because there are some sequences that won't get printed out. When I try to push the keys that exist into an array and then print the scalar of that array, I still don't get the results I need. Here is what I tried (all variables that need to be global are):
open my $out, '>', "/path/to/file.$number.txt" or die $!;
foreach my $sequence(#elements){
if (exists $hash{$sequence}){
push(#Array, $hash{$sequence}, "\n");
my $number = #Array;
print $out ">$sequence\n$hash{$sequence}\n";
#....
Thanks for the help. Really appreciate it.
my $sequences = grep exists $hash{$_}, #elements;
open my $out, '>', "/path/to/outfile_containing_$sequences.txt" or die $!;
In list context, grep filters a list by a criterion; in scalar context, it returns a count of elements that met the criterion.
The easiest way would be to keep track of how many keys you are printing in a variable and once your loop finish, just rename the file with the number you calculated. Perl comes with a built-in function to do this. The code would be something like this:
use strict;
use warnings;
open(INFILE, "<", "/path/to/infile.txt") or die $!;
my %hash = (
TAGC => 'THIS_EXISTS',
GCTA => 'THIS_DOESNT_EXIST',
);
my $ammt;
while (my $line = <INFILE>){
chomp $line;
my $hash;
my #elements = split "\t", $line;
open my $out, '>', "/path/to/outfile.txt" or die $!;
foreach my $sequence(#elements){
if (exists $hash{$sequence}){
print $out ">$sequence\n$hash{$sequence}\n";
$ammt++;
}
else
}
print "Doesn't exist ", $count, "\n";
}
}
}
rename "/path/to/outfile.txt", "/path/to/outfile${ammt}.txt" or die $!;
I removed the $count variable, since it's not declared in your code (strict would complain about that). Here's the official doc for rename. Since it returns True or False, you can check that it was successful or not.
By the way, be aware that:
push(#Array, $hash{$sequence}, "\n");
is storing two items ($hash{$sequence} and \n), so that count would be twice as it should be.

How can I find the common items from two arrays fast?

I am trying to findy the common lines among two tab separated files based one field.
One line of the first file:
1 52854 s64199.1 A . . . PR GT 0/0
One line of the second file:
chr1 52854 . C T 215.302 . AB=0.692308;ABP=7.18621;AC=1;AF=0.5;AN=2;AO=9;CIGAR=1X;DP=13;DPB=13;DPRA=0;EPP=3.25157;EPPR=3.0103;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=17.5429;PAIRED=0;PAIREDR=0.25;PAO=0;PQA=0;PQR=0;PRO=0;QA=318;QR=138;RO=4;RPP=3.25157;RPPR=5.18177;RUN=1;SAF=0;SAP=22.5536;SAR=9;SRF=1;SRP=5.18177;SRR=3;TYPE=snp;technology.illumina=1;BVAR GT:DP:RO:QR:AO:QA:GL 0/1:13:4:138:9:318:-5,0,-5
Based on the second field (52854) in this example I have a much.
Here is my code which finds the common ones, but my files are quite large and takes a lot of time.
Is there any way to speed up the process?
Thank you very much in advance.
#!/app/languages/perl/5.14.2/bin/perl
use strict;
use warnings;
my $map_file = $ARGV[0];
my $vcf_file = $ARGV[1];
open my $map_info, $map_file or die "Could not open $map_file: $!";
my #map_array = ();
my #vcf_array = ();
while( my $mline = <$map_info>) {
chomp $mline;
my #data1 = split('\t', $mline);
my $pos1 = $data1[1];
push (#map_array, $pos1);
}
open my $vcf_info, $vcf_file or die "Could not open $vcf_file: $!";
while( my $line = <$vcf_info>) {
if ($line !~ m/^#/) {
push (#vcf_array, $line);
}
}
foreach my $a (#map_array) {
chomp $a;
foreach my $b (#vcf_array) {
chomp $b;
my #data = split('\t', $b);
my $pos2 = $data[1];
my $ref2 = $data[3];
my $allele = $data[4];
my $genotype = $data[9];
if ($a == $pos2) {
print $pos2 . "\t" . $ref2. "\t".$allele."\t".$genotype. "\n";
#print "$b\n";
}
}
}
Here's a version that should run much faster than your own
It reads the map file and stores each pos field in hash %wanted. Then it reads through the second file and checks whether the record is in the list of wanted values. If so then it splits the record and prints the fields you require
Note that I haven't been able to test this beyond making sure that it compiles
use strict;
use warnings;
use 5.010;
use autodie;
my ( $map_file, $vcf_file ) = #ARGV;
my %wanted;
{
open my $map_fh, '<', $map_file;
while ( <$map_fh> ) {
chomp;
my $pos = ( split /\t/, $_, 3 )[1];
++$wanted{$pos};
}
}
{
open my $vcf_fh, '<', $vcf_file;
while ( <$vcf_fh> ) {
next if /^#/;
chomp;
my $pos = ( split /\t/, $_, 3 )[1];
next unless $wanted{$pos};
my ( $ref, $allele, $genotype ) = ( split /\t/ )[3, 4, 9];
print join("\t", $pos, $ref, $allele, $genotype), "\n";
}
}
Below please find minimal modification of your script for hash based searches
use strict;
use warnings;
my $map_file = $ARGV[0];
my $vcf_file = $ARGV[1];
my %vcf_hash;
open( my $vcf_info, $vcf_file) or die "Could not open $vcf_file: $!";
while( my $line = <$vcf_info>) {
next if $line =~ m/^#/; # Skip comment lines
chomp $line;
my (#data) = split(/\t/, $line);
die unless #data >= 10; # Check number of fields in the input line
my ($pos) = $data[1];
# $. - line number in the file
$vcf_hash{$pos}{$.} = \#data;
}
open( my $map_info, $map_file) or die "Could not open $map_file: $!";
while( my $mline = <$map_info>) {
chomp $mline;
my (#data) = split(/\t/, $mline);
die unless #data >= 2; # Check number of fields in the input line
my ($pos) = $data[1];
if( exists $vcf_hash{$pos}) {
my $hash_ref = $vcf_hash{$pos};
for my $n (sort{$a<=>$b} keys %$hash_ref) {
my $array_ref = $hash_ref->{$n};
my $pos2 = $array_ref->[1];
my $ref2 = $array_ref->[3];
my $allele = $array_ref->[4];
my $genotype = $array_ref->[9];
print $pos2 . "\t" . $ref2. "\t".$allele."\t".$genotype. "\n";
}
}
}
The script may be improved further to reduce memory use if you use huge data files.
There is not a need to keep your map_file in memory but just keys. It is good to make them keys in a hash which you use for existence checking. You don't have to keep your vcf_file in memory as well, but you can just make a decision to output or not.
#!/app/languages/perl/5.14.2/bin/perl
use strict;
use warnings;
use autodie;
use constant KEY => 1;
use constant FIELDS => ( 1, 3, 4, 9 );
my ( $map_file, $vcf_file ) = #ARGV;
my %map;
{
my $fh;
open $fh, '<', $map_file;
while (<$fh>) {
$map{ ( split /\t/, $_, KEY + 2 )[KEY] } = undef;
}
}
{
my $fh;
open $fh, '<', $vcf_file;
while (<$fh>) {
next if /^#/;
chomp;
my #data = split /\t/;
print join "\t", #data[FIELDS] if exists $map{ $data[KEY] };
}
}

Perl Hashes of Arrays and Some issues

I currently have a csv file that looks like this:
a,b
a,d
a,f
c,h
c,d
So I saved these into a hash such that the key "a" is an array with "b,d,f" and the key "c" is an array with "h,d"... this is what I used for that:
while(<$fh>)
{
chomp;
my #row = split /,/;
my $cat = shift #row;
$category = $cat if (!($cat eq $category)) ;
push #{$hash{$category}}, #row;
}
close($fh);
Not sure about the efficiency but it seems to work when I do a Data Dump...
Now, the issue I'm having is this; I want to create a new file for each key, and in each of those files I want to print every element in the key, as such:
file "a" would look like this:
b
d
f
<end of file>
Any ideas? Everything I've tried isn't working, I'm not too familiar / experienced with hashes...
Thanks in advance :)
The output process is very simple using the each iterator, which provides the key and value pair for the next hash element in a single call
use strict;
use warnings;
use autodie;
open my $fh, '<', 'myfile.csv';
my %data;
while (<$fh>) {
chomp;
my ($cat, $val) = split /,/;
push #{ $data{$cat} }, $val;
}
while (my ($cat, $values) = each %data) {
open my $out_fh, '>', $cat;
print $out_fh "$_\n" for #$values;
}
#!/usr/bin/perl
use strict;
use warnings;
my %foos_by_cat;
{
open(my $fh_in, '<', ...) or die $!;
while (<$fh_in>) {
chomp;
my ($cat, $foo) = split /,/;
push #{ $foos_by_cat{$cat} }, $foo;
}
}
for my $cat (keys %foos_by_cat) {
open(my $fh_out, '>', $cat) or die $!;
for my $foo (#{ $foos_by_cat{$cat} }) {
print($fh_out "$foo\n");
}
}
I wrote the inner loop as I did to show the symmetry between reading and writing, but it can also be written as follows:
print($fh_out "$_\n") for #{ $foos_by_cat{$cat} };

match a list of identifiers to another list using perl

I am stuck on how to loop a list over another list. Perhaps I am not searching for the right words but I am stuck and would appreciate any help on my code.
I went over this thread but am still getting errors running my script. How do I search a Perl array for a matching string?
the database file
chr1 1692239 1692249 AH_GARP2_comp198_c0_seq1
chr1 2233934 2233944 CS_GARP2_comp323_c0_seq1
chr1 5993434 5993444 CS_GARP2_comp635_c0_seq1
chr1 6198157 6198167 CS_GARP2_comp115_c0_seq1
chr1 6465781 6465791 JB_GARP2_comp560_c0_seq1
chr1 7827923 7827933 JB_GARP2_comp855_c0_seq1
chr1 7920939 7920949 AA_GARP2_comp614_c0_seq1
chr1 7964000 7964010 CS_GARP2_comp717_c0_seq1
chr1 9314857 9314867 AH_GARP2_comp237_c0_seq1
chr1 9654532 9654542 AH_GARP2_comp632_c0_seq1
the query file
name1 CS_GARP2_comp635_c0_seq1
name2 JB_GARP2_comp855_c0_seq1
name3 AH_GARP2_comp198_c0_seq1
name4 AH_GARP2_comp237_c0_seq1
My code
#!/usr/bin/perl5.16.2
use 5.16.2;
use lib '/users/ec1/perl5/lib/perl5/';
use warnings;
use strict;
my $filename = shift; ## database
my $filename2= shift; ## list of ids
open (DB, '<', $filename ) || die "Unable to open: $!";
open (I , '<', $filename2) || die "Unable to open: $!";
my #DB;
while (<DB>) {
chomp;
my #DB = split /\t/, $_; ## define as list
#print "#DB[0,1,2,3]\n";
}
while (my $line = <I>) {
chomp $line;
my ($name, $id) = split /\t/, $line;
if ($DB[3] =~ /$id/) {print "$DB[0]\t$DB[1]\t$DB[2]\t$DB[3]\n";
} else {print "na\n"}
}
Put DB into hash (%DB) to ease searches.
use strict; use warnings;
my $filename = shift; ## database
my $filename2= shift; ## list of ids
my %DB;
open (DB, '<', $filename ) || die "Unable to open: $!";
while (<DB>) {
chomp;
my #row = split( /[ \t]+/, $_); ## define as list
die "expected four items in db file - line $.\n" unless #row == 4; # expect four elements in a row
die "duplicate id in db file - line $." if exists $DB{$row[3]};
$DB{$row[3]} = \#row;
}
close DB;
open (I , '<', $filename2) || die "Unable to open: $!";
while (<I>) {
chomp;
my ($name, $id) = split /[ \t]+/, $_;
if(exists ($DB{$id})) {
my #row = #{$DB{$id}};
print join("\t", #row), "\n";
} else {
print "na\n"
}
}
close(I);
P.S. I have changed split pattern to ease tests of copy&paste data file content
The problem with your approach is that the my #DB inside while loop creates a lexical scope so it will always contain the last line's contents and that content will not be available outside the loop.
You should read the id file in to a hash map and check if the line for db file exists as a key in hash.
#!/usr/bin/perl
use warnings;
use strict;
use autodie;
my ($dbfile, $idfile) = #ARGV;
open my $id_fh, '<', $idfile;
open my $db_fh, '<', $dbfile;
my %ids;
while (<$id_fh>) {
++$ids{$_} for (split /\s+/)[1]; # split and put column2 as key in %ids.
}
while (<$db_fh>) {
my $fld = (split /\s+/)[3]; # split and assign column4 to $fld
print "na\n" and next unless $ids{$fld}; # print "na" if fld is not in hash and move to next line
print "$_"; # print the line if column4 exits.
}

problem with the code in perl

My problem is that I am not able to figure out that why my code is taking each of the line from the file as one element of an array instead of taking the whole record starting from AD to SS as one element of the array. As you can see that my file is starting from AD and ending at SS which is same for all the followed lines in the data. But I want to make the array having elements starting from AD to SS which will be having all the lines in between AD to SS that is BC....,EG...., FA.....etc.not each line as an element. I tried my way and get the same file as such.Could anyone check my code. Thanks in advance.
AD uuu23
BC jjj
EG iii
FA vvv
SS
AD hhh25
BC kkk
EG ppp
FA aaa
SS
AD ttt26
BC xxx
FA rrr
SS
#!/usr/bin/env perl
use strict;
use warnings;
my $ifh;
my $line = '';
my #data;
my $ifn = "fac.txt";
open ($ifh, "<$ifn") || die "can't open $ifn";
my $a = "AD ";
my $b = "SS ";
my $_ = " ";
while ($line = <$ifh>)
{
chomp
if ($line =~ m/$a/g); {
$line = $_;
push #data, $line;
while ($line = <$ifh>)
{
$line .= $_;
push #data, $line;
last if
($line =~ m/$b/g);
}
}
push #data, $line; }
print #data;
If I understand correctly your problem, the fact is that the way you are reading the file:
while ($line = <$ifh>)
is inherently a line-by-line approach. It uses the content of the "line termination variable" ($/) to understand where to split lines. One easy way to change this behavior is un-defining the $/:
my $oldTerminator = $/;
undef $/;
....... <your processing here>
$/ = $oldTerminator;
so, your file would be just one line, but I am not sure what would happen of your code.
Another approach is the following (keeping in mind what I said about the fact that you are reading the file line-by-line): instead of doing
`push #data, $line;`
at each iteration of your loop, just accumulate the lines you read in a variable
$line .= $_;
(like you already do), and do the push only at the end, just once. Actually, this second approach will be more easily applicable to your code (you only have to remove the two push statements you have and put one outside of the loop).
I believe part of your problem is here
chomp
if ($line =~ m/$a/g);
it should be
chomp;
if ($line =~ m/$a/g)
otherwise the if statement is always executed. Please update your question if this has helped you advance
Here's a way to accomplish reading the records into an array, with newlines removed:
Code:
use strict;
use warnings;
use autodie;
my #data;
my $record;
my $file = "fac.txt";
open my $fh, '<', $file;
while (<$fh>) {
chomp;
if (/^AD /) { # new record starts
$record = $_;
while (<$fh>) {
chomp;
$record .= $_;
last if /^SS\s*/;
}
push #data, $record;
} else { die "Data outside record: $_" }
}
use Data::Dumper;
print Dumper \#data;
Output:
$VAR1 = [
'AD uuu23BC jjjEG iiiFA vvvSS',
'AD hhh25BC kkkEG pppFA aaaSS',
'AD ttt26BC xxxFA rrrSS'
];
This is another version, using the input record separator $/:
use strict;
use warnings;
use autodie;
my $file = "fac.txt";
open my $fh, '<', $file;
my #data;
$/ = "\nSS";
while (<$fh>) {
s/\n//g;
push #data, $_;
}
use Data::Dumper;
print Dumper \#data;
Produces the same output with this data. It does not care about the record start characters, only the end, which is SS at the beginning of a line.

Resources