slicing out multiples of nth element - arrays

I am having a list which has more than 10k elements. I want to remove every third element.
For example,
#testlists = qw (helloworld sessions first.cgi login localpcs depthhashes.cgi search view macros plugins ...) ;
I want to remove first.cgi, depthhashses.cgi, macros and so on from original array. Grep function is little bit slower. Please suggest me a faster grep search or any other similar subroutines . Any help would be highly appreciated

I can think of a few solutions:
Grep on index divisibility
my $i = 0;
#testlist = grep { ++$i % 3 } #testlist;
repeated splicing
for (my $i = 2; $i < $#testlist; $i += 2) {
splice #testlist, $i, 1;
}
Copying with skipping
my #output;
# pre-extend the array for fewer reallocations
$#output = #testlist * 2/3;
#output = ();
# FIXME annoying off-by one errors at the end that can add one undef
for (my $i = 0; $i < #testlist; $i += 3) {
push #output, #testlist[$i, $i+1];
}
Ikegami has corrected and optimized the copying solution in his remarkable answer.
A benchmark with 1,000-element lists declares splicing the clear winner:
Rate slice grep copy splice
slice 790/s -- -10% -18% -37%
grep 883/s 12% -- -8% -29%
copy 960/s 22% 9% -- -23%
splice 1248/s 58% 41% 30% --
(slice is mob's solution)
This is probably because it offloads most of the actual work into the C-level implementations, and avoids allocations and costly Perl-level operations.
With 10,000-element lists, the advantages shift towards the other solutions. Indeed, the splice solution has a very bad algorithmic complexity, because it moves all elements after all splicing position, meaning that the last element is moved nearly 3333 times:
Rate splice slice grep copy
splice 42.7/s -- -35% -42% -49%
slice 65.3/s 53% -- -12% -23%
grep 74.2/s 74% 14% -- -12%
copy 84.4/s 98% 29% 14% --
Here is the script I used for benchmarking.

amon's copy can be tweaked to be 30% faster!
my $i = 1;
my $j = 1;
while ($i < #a) {
$a[++$j] = $a[$i+=2];
$a[++$j] = $a[++$i];
}
$#a = $j-1 if #a>2;
And you can get even faster by avoid copying anything at all (though the result is a reference to an array). The longer the strings in each element, the better this will do.
my $i = 0;
my $ref = sub { \#_ }->( grep { ++$i % 3 } #a );
Results (10,000 elements):
>perl a.pl
Rate splice_a splice_r grep_a copy1_a copy1_r copy2_r copy2_a grep_r
splice_a 52.8/s -- -0% -51% -54% -56% -66% -66% -68%
splice_r 52.9/s 0% -- -51% -54% -55% -66% -66% -68%
grep_a 107/s 103% 103% -- -7% -10% -30% -31% -34%
copy1_a 115/s 118% 117% 7% -- -3% -25% -26% -30%
copy1_r 119/s 125% 124% 11% 3% -- -23% -23% -27%
copy2_r 154/s 191% 190% 43% 34% 29% -- -0% -6%
copy2_a 154/s 192% 192% 44% 34% 30% 0% -- -6%
grep_r 163/s 209% 209% 52% 42% 37% 6% 6% --
Benchmark:
use strict;
use warnings;
use Benchmark qw( cmpthese );
my #testlist = qw( helloworld sessions first.cgi login localpcs depthhashes.cgi search view macros );
#testlist = ( #testlist ) x ( 10000 / #testlist );
sub grep_a { my #a = #testlist; my $i = 0; #a = grep { ++$i % 3 } #a; 1 }
sub copy1_a { my #a = #testlist;
my #b;
$#b = $#a; #b = (); # Does absolutely nothing in this benchmark because of optimisations in Perl.
for (my $i = 0; $i < #a; $i += 3) {
push #b, #a[$i, $i+1];
}
1
}
sub copy2_a { my #a = #testlist;
my $i = 1;
my $j = 1;
while ($i < #a) {
$a[++$j] = $a[$i+=2];
$a[++$j] = $a[++$i];
}
$#a = $j-1 if #a>2;
1
}
sub splice_a { my #a = #testlist;
for (my $i = 2; $i < $#a; $i += 2) {
splice #a, $i, 1;
}
1
}
sub grep_r { my $r = [ #testlist ]; my $i = 0; $r = sub { \#_ }->( grep { ++$i % 3 } #$r ); 1 }
sub copy1_r { my $r = [ #testlist ];
my #b;
$#b = $#$r; #b = (); # Does absolutely nothing in this benchmark because of optimisations in Perl.
for (my $i = 0; $i < #$r; $i += 3) {
push #b, #$r[$i, $i+1];
}
$r = \#b;
1
}
sub copy2_r { my $r = [ #testlist ];
my $i = 1;
my $j = 1;
while ($i < #$r) {
$r->[++$j] = $r->[$i+=2];
$r->[++$j] = $r->[++$i];
}
$#$r = $j-1 if #$r>2;
1
}
sub splice_r { my $r = [ #testlist ];
for (my $i = 2; $i < $#$r; $i += 2) {
splice #$r, $i, 1;
}
1
}
cmpthese(-3, {
grep_a => \&grep_a,
copy1_a => \&copy1_a,
copy2_a => \&copy2_a,
splice_a => \&splice_a,
grep_r => \&grep_r,
copy1_r => \&copy1_r,
copy2_r => \&copy2_r,
splice_r => \&splice_r,
});

Use an array slice.
#testlists = #testlists[ grep { ($_+1) % 3 } 0..$#testlists ];

I'm not sure what you mean about using grep but perhaps you mean something like this.
for $i (0 .. $#testlists) {
if (($i % 3) == 2) {
delete $testlists[$i];
}
}
# Demonstrate the values.
foreach $e (#testlists) {
print "$e\n";
}

Related

Is there a built in Perl Function for finding duplicate subarrays(exact order) in an array?

Lets say the array is (1,2,3,4,5,6,7,8,9),
Another subarray is (2,3,4)
Is there a function to check if the subarray pattern(full exact order) exists within array?
In this case, it would return any indicator(index) that shows it exists.
Also would need to work for duplicates if there are multiple subarrays existing in the array like (4,2,3,4,2,3,4).
If it happens to match multiple times for example:
Array = (2,3,2,3,2,2,3,2)
Sub Array = (2,3,2)
Would just return starting index of matches in order: 0,2,5
Or if it removes, would result in (3,2)
Edit: Elements don't have to be num
There's no built-in method, but it's easy to write:
#!/usr/bin/env perl
use warnings;
use strict;
use feature qw/say/;
# Takes two arrayrefs of numbers.
#
# Returns the first index in the first one where the second list appears, or
# -1 if not found.
sub find_sublist(++) {
my ($haystack, $needle) = #_;
my $nlen = #$needle;
my $hlen = #$haystack;
return -1 if $hlen == 0 || $nlen == 0;
HAYSTACK_POS:
for (my $n = 0; $n <= $hlen - $nlen; $n++) {
for (my $m = 0; $m < $nlen; $m++) {
if ($haystack->[$n + $m] != $needle->[$m]) {
next HAYSTACK_POS;
}
}
return $n;
}
return -1;
}
# Takes two arrayrefs of numbers.
#
# Returns a list of the starting indexes of the first list
# of every run of the second list. Returns an empty list if
# there are no matches.
sub find_sublists(++) {
my ($haystack, $needle) = #_;
my $nlen = #$needle;
my $hlen = #$haystack;
my #positions;
return #positions if $hlen == 0 || $nlen == 0;
HAYSTACK_POS:
for (my $n = 0; $n <= $hlen - $nlen; $n++) {
for (my $m = 0; $m < $nlen; $m++) {
if ($haystack->[$n + $m] != $needle->[$m]) {
next HAYSTACK_POS;
}
}
push #positions, $n;
}
return #positions;
}
# Takes two arrayrefs of numbers.
#
# Returns a new list that is the first one with every non-overlapping run of
# the second second list removed.
sub remove_sublists(++) {
my #haystack = #{$_[0]};
my $needle = $_[1];
while ((my $pos = find_sublist #haystack, $needle) != -1) {
splice #haystack, $pos, #$needle;
}
return #haystack;
}
my #list1 = (1,2,3,4,5,6,7,8,9);
my #list2 = (4,2,3,4,2,3,4);
my #list3 = (2,3,2,3,2,2,3,2);
say find_sublist(#list1, [2, 3, 4]); # Returns 1
say find_sublist([2,9,3,4], [2,3,4]); # Returns -1
my #positions = find_sublists(#list2, [2,3,4]); # 1,4
say join(",", #positions);
#positions = find_sublists(#list3, [2,3,2]); # 0,2,5
say join(",", #positions);
say join(",", remove_sublists(#list1, [2,3,4])); # 1,5,6,7,8,9
say join(",", remove_sublists(#list3, [2,3,2])); # 3,2
If the inputs are numbers representable by your perl's integers (as shown), you can use
# Indexes
my $pattern = pack "W*", #pattern;
my $array = pack "W*", #array;
my #indexes;
push #indexes, $-[0] while $array =~ /\Q$pattern/g;
# Removal
my $pattern = pack "W*", #pattern;
my $array = pack "W*", #array;
$array =~ s/\Q$pattern//g;
#array = unpack "W*", $array;
How it handles overlaps:
/---\ /---\ Removed
2,3,2 from 2,3,2,3,2,2,3,2
\---/ Not removed
Note that this also works if you can map the inputs to numbers.
my ( %map_f, #map_r );
for ( #array, #pattern ) {
if ( !exists{ $map{ $_ } } ) {
$map_f{ $_ } = #map_r;
push #map_r, $_;
}
}
my $pattern = pack "W*", #map_f{ #pattern };
my $array = pack "W*", #map_f{ #array };
$array =~ s/\Q$pattern//g;
#array = #map_r[ unpack "W*", $array ];
It's not the best algorithm, but it should be very fast by moving the work from Perl to the regex engine.

Is there any function in Perl that shifts the array element without removing them?

I have an array, say #array1 = qw(abc def ghi jkl).
Now, I want to use this array in a way that elements are shifted 1 by 1, but that shifting takes place virtually, and not in the array.
Like, "shift" will shift the elements and remove them from the array. But, I don't want those elements to be removed.
Short Code Snippet:
while (my $rName = shift #array1) {
my $bName = shift #array1 ;
## Do something now with the value
}
##And now, I want that I can use #array1 again with the original elements residing
How can it be implemented?
In Perl 5.36 you'll be able to do this:
for my ($rName, $bName) (#array1) { ... }
Use a C-style for loop and increment by two. $#foo is the index of the last element.
my #foo = 0 .. 5;
for (my $i = 0; $i <= $#foo; $i += 2){
my $r_name = $foo[$i];
my $b_name = $foo[$i+1];
}
If you wanted fancier-looking code, you could use natatime from List::MoreUtils on CPAN, which gives you an iterator that you can use in a while loop.
use List::MoreUtils 'natatime';
my #foo = 0 .. 5;
my $it = natatime 2, #foo;
while ( my ($r_name, $b_name) = $it->() ) {
print "$r_name $b_name\n";
}
You can also use pairs from the core List::Util module:
A convenient shortcut to operating on even-sized lists of pairs, this function returns a list of ARRAY references, each containing two items from the given list.
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/say/;
use List::Util qw/pairs/;
my #array1 = qw/a 1 b 2 c 3/;
for my $pair (pairs #array1) {
my ($rName, $bName) = #$pair;
say "$rName => $bName";
}
say "#array1";
# Require Perl 5.36
for my ( $rName, $bName ) ( #array1 ) {
...
}
my #array2 = #array1;
while ( #array2 ) {
my $rName = shift #array2;
my $bName = shift #array2;
...
}
for my $i ( 0 .. $#array1/2 ) {
my $rName = $array1[ $i * 2 + 0 ];
my $bName = $array1[ $i * 2 + 1 ];
...
}
for ( my $i = 0; $i < #array1; ) {
my $rName = $array1[ $i++ ];
my $bName = $array1[ $i++ ];
...
}
use List::Util qw( pairs );
for ( pairs #array1 ) {
my ( $rName, $bName ) = #$_;
...
}
I tried to organize them from fastest to slowest, but I didn't actually benchmark anything.
Three of the solutions were previously mentioned, but noone mentioned what I think is the fastest solution that doesn't require 5.36 (which hasn't been released yet).

How to split an array in uneven batches in Perl

I am trying to split an array into batches of 500, but I want to cover the case where the array size is not evenly divided by 500 as well. For example:
Array = [1,...,1100]
subArray1 = [1..500]
subArray2 = [500..1000]
subArray3 = [1000..1100]
Here is what I have so far:
my #array = [prepopulated];
my $array_size = scalar(#array);
my $start = 0;
my $end = 500;
while ($start < $array_size) {
my #subArray = #array[$start..$end];
print Dumper(\#subArray);
$start = $end;
if($end + 500 > $array_size) {
$end = $start + ($array_size % 500);
} else {
$end += 500;
}
}
This seems to work, but is there a better way to do this or does this break for some case I am not seeing?
List::MoreUtils covers exactly this kind of task:
use strict; use warnings;
use List::MoreUtils qw(natatime);
my #Array = (1..1100);
my $it = natatime 500, #Array;
while (my #vals = $it->()) {
printf "%d\n", 0+#vals;
}
500
500
100
No need to reinvent the wheel.
I think splice would be simpler and cleaner:
use strict;
use warnings;
my #array = (1 .. 1100);
my $batch_size = 500;
while (#array) {
my #batch = splice(#array, 0, $batch_size);
# ...
}
Keep in mind that perl's range operator start..end is inclusive both sides. 0..500 gives you: 0, 1, ..., 499, 500
Assuming you don't want to duplicate boundary elements (e.g. index 500 included in two sub arrays), try the following.
Summary:
change $end initialization to 499
inside the loop set $start = $end+1
when checking if the new end is overflowing use >= instead: $end + 500 >= $array_size
code
my #array = [prepopulated];
my $array_size = scalar(#array);
my $start = 0;
my $end = 499;
while ($start < $array_size) {
my #subArray = #array[$start..$end];
print Dumper(\#subArray);
$start = $end+1;
if($end + 500 >= $array_size) {
$end = $array_size-1;
} else {
$end += 500;
}
}
Creating a second array of indexes for natatime is extremely wasteful.
use List::Util qw( min );
for (my $start=0; $start<#array; $start+=500) {
my $end = min($start+500, 0+#array);
...
}

Perl - Split sorted array into two arrays based on values

I am fairly new to Perl.
I have a large array of numbers which is sorted first:
my #numbers = do {
open my $fh, '<', glob("hotfolder/*.out") or die $!;
local $/;
<$fh> =~ /\d+/g;
};
my #sorted_numbers = sort { $a <=> $b } #numbers;
The sorted values are now in the #sorted_numbers array.
Next it needs to find the missing values in this array:
my #missing;
my $i = 0;
for ( $sorted_numbers[ 0 ] .. $sorted_numbers[ -1 ] ) {
++$i, next if $_ == $sorted_numbers[ $i ];
push #missing, "$_\n";
}
This is where I need some help. I am not sure how to achieve the next step.
Each number in the array either starts with 2 (200000) or 3 (300000).
The missing values should only show between these ranges.
Let's say the first range finished at 240000. The other range will start somewhere over 300000.
Of course I do not want to return all the values outside these two ranges as missing.
I tried something along these lines but the syntax and logic is completely wrong:
foreach (my $step = #sorted_numbers) {
if ($step <= 299999) {
my $i = 0;
for ( $sorted_numbers[ 0 ] .. $sorted_numbers[ -1 ] ) {
++$i, next if $_ == $sorted_numbers[ $i ];
push #missing, "$_\n";
}
}
else ($step > 299999) {
my $i = 0;
for ( $sorted_numbers[ 0 ] .. $sorted_numbers[ -1 ] ) {
++$i, next if $_ == $sorted_numbers[ $i ];
push #missing2, "$_\n";
}
}
}
Let me know if something is unclear and I'll be happy to provide further information.
Just check the difference between neighbouring numbers. If it's more than 1 but small enough (see $gap_size below), report the missing numbers:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $gap_size = 50; # Tune this to your needs.
my #numbers = sort { $a <=> $b } map /\d+/g, <>;
for my $i (1 .. $#numbers) {
my $diff = $numbers[$i] - $numbers[$i-1];
if ($diff > 1 && $diff < $gap_size) {
say for $numbers[$i-1] + 1 .. $numbers[$i] - 1;
}
}

algorithm binary search return range perl

I have got one array with milions of floating points number (ordered) and then another one smaller and i need to match the numbers within a certain tolerance (there is no overlap between the values in the big array and the values in the small array) from the small array in the big array. okay so no big deal, this is the perl function for returning a not perfect match within the tolerance, it's inside a for loop there I am looping through the small array values.
sub bin_search{
my ($arr, $v, $t ) = #_;
my ($min, $max) = (0, #$arr-1);
while ($min <= $max) {
my $w = $v + $t;
my $k = $v - $t;
my $try = int( ( $min + $max ) / 2 );
$min = $try + 1, next if $arr -> [$try] < $k ;
$max = $try - 1, next if $arr -> [$try] > $w ;
return $arr -> [$try] ;
}
return 0;
}
but then after checkin my data it seems I have got some values discarded because it is returning just the first match.
I tried grep but it is too slow.
my $min = $val - $t;
my $max = $val + $t;
my #arr2 = grep { ( $_ > $min ) && ($_ < $max ) }#big_arr1;
so I wanted to modify a bit the binary search for returning a range from $min to $max, because I thought once there is one match is either at $min or $max, so something like
sub bin_search{
my ($arr, $v, $t ) = #_;
my ($min, $max) = (0, #$arr-1);
my $w = $v + $t;
my $k = $v - $t;
while ($min <= $max) {
my $try = int( ( $min + $max ) / 2 );
$min = $try + 1, next if $arr -> [$try] < $k ;
$max = $try - 1, next if $arr -> [$try] > $w ;
last;
}
my #fin;
if ( ($arr -> [$try] < $w) && ($arr -> [$try] > $k) ) {
push #fin, $arr ->[$try]; $try++ }
return \#fin;
}
but I am missing some values, and I think that I am missing something, should I look just at one direction at the time? like left until we reach the lower limit then return to $try and do the same thing until higher limit?
Start by finding the index of a matching element using a binary search.
Once you have that, you need to find where the range starts. You can use a binary search for that as well, but a linear search is also acceptable if the number of matching elements is usually small.
Finally, you need to find the end of the range. You can use the same approach as you used for finding the start of the range.
The problem with your solution is that you didn't look for the start of the range.
The following is an untested implementation that uses the linear scan approach (like yours), so it assumes that there will be very few matching elements:
sub binsearch_numeric_range {
my $min = shift;
my $max = shift;
my $array = shift;
return () if !#$array;
my $i = 0;
my $j = $#$array;
my $k;
while (1) {
$k = int(($i+$j)/2);
if ($array->[$k] > $max) {
$j = $k-1;
return () if $i > $j;
}
elsif ($array->[$k] < $min) {
$i = $k+1;
return () if $i > $j;
}
else {
last;
}
}
my $min_k = $k; --$min_k while $min_k > 0 && $array->[$min_k - 1] >= $min;
my $max_k = $k; ++$max_k while $max_k < $#$array && $array->[$max_k + 1] <= $max;
return #$array[$min_k .. $max_k];
}
my #matches = binsearch_numeric_range($v-$t, $v+$t, $arr);
An implementation that doesn't require writing a whole new binsearch:
my $idx = binsearch { abs($a-$b) <= $t ? 0 : $a <=> $b } $v, #$arr;
my #range;
if ($idx >= 0) {
my $min_idx = $idx; --$min_idx while $min_idx > 0 && $arr->[$min_idx-1] >= ($v-$t);
my $max_idx = $idx; ++$max_idx while $max_idx < $#$arr && $arr->[$max_idx+1] <= ($v+$t);
#range = #$array[$min_idx .. $max_idx];
}
The binsearch used is defined here.

Resources