efficiently replacing many subarrays of 2d array in numpy on large datasets - arrays

Is there a way to speed it up? takes too long on realy large datasets...
"matrice" is a list of numpy arrays with no solid lenght (some could be 1-5 elements longer or shorter)
def replaceScaleBelowZero(arr):
if np.amax(arr)<=0:
arr[arr<=0] = minmax_scale(arr[arr<=0],(min_thresh*0.75,min_thresh))
elif np.min(arr)<=0:
arr[arr<=0] = minmax_scale(arr[arr<=0],(min_thresh*0.75,min_thresh))
return arr
def replaceScaleBelowMinThresh(arr):
if np.amax(arr)<min_thresh:
arr[arr<sl_thresh] = minmax_scale(arr[arr<min_thresh],(min_thresh*0.75,min_thresh))
elif np.min(arr)<min_thresh:
arr[arr<min_thresh] = minmax_scale(arr[arr<min_thresh],(min_thresh*0.75,min_thresh))
return arr
matrice = [replaceScaleBelowZero(slice_ ) for slice_ in matrice ]
matrice = [replaceScaleBelowMinThresh(slice_ ) for slice_ in matrice ]

sklearn.preprocessing.minmax_scale use a lot of checkings. If you rewrite your
arr[arr<=0] = minmax_scale(arr[arr<=0],(min_thresh*0.75,min_thresh))
as
a = arr[arra<=0]
a -= a.min()
a /= a.max()
a *= (0.25 * min_thresh)
a += 0.75 * min_thresh
(assuming arr is 1d) it should be faster. If it works I think it can be optimized further with rewriting this -=, /=, *=, += with just two such opertaions.
In your second function you use
arr[arr<sl_thresh] = minmax_scale(arr[arr<min_thresh] ...
if sl_thresh != min_thresh this can give errors. If sl_thresh = min_thresh I guess you can drop you if-else clause as your ValueError likely was raised by sklearn.

Related

Updated Python Gurus: Find the difference between these two running sum functions(do they do the same thing?) Hackerrank: Array Manipulation

On Hackerrank, I was solving the Array Manipulation question and when I submitted my code with runningSum_A my code passed all the test cases. But when I submitted runningSum_B, it failed many test cases. Can someone please explain the logic behind this? Is there something inherently different between these two functions, because they seem the same to me? (Link to the question: https://www.hackerrank.com/challenges/crush/problem?h_l=interview&playlist_slugs%5B%5D=interview-preparation-kit&playlist_slugs%5B%5D=arrays)
def runningSum_A(arr):
maxSum = -math.inf
runningSum = 0
for i in arr:
runningSum += i
maxSum = max(maxSum, runningSum)
return maxSum
def runningSum_B(arr):
for i in range(1, len(arr)):
arr[i] += arr[i - 1]
return max(arr)
def arrayManipulation(n, queries):
arr = [0]*(n+2)
for query in queries:
arr[query[0]] += query[2]
arr[query[1]+1] -= query[2]
############################################################
#call runningSum_A or runningSum_B function here

Move Zeroes in Scala

I'm working on "Move Zeroes" of leetcode with scala. https://leetcode.com/problems/move-zeroes/description/
Given an array nums, write a function to move all 0's to the end of it while maintaining the relative order of the non-zero elements. You must do this in-place without making a copy of the array.
I have a solution which works well in IntelliJ but get the same Array with input while executing in Leetcode, also I'm not sure whether it is done in-place... Something wrong with my code ?
Thanks
def moveZeroes(nums: Array[Int]): Array[Int] = {
val lengthOrig = nums.length
val lengthFilfter = nums.filter(_ != 0).length
var numsWithoutZero = nums.filter(_ != 0)
var numZero = lengthOrig - lengthFilfter
while (numZero > 0){
numsWithoutZero = numsWithoutZero :+ 0
numZero = numZero - 1
}
numsWithoutZero
}
And one more thing: the template code given by leetcode returns Unit type but mine returns Array.
def moveZeroes(nums: Array[Int]): Unit = {
}
While I agree with #ayush, Leetcode is explicitly asking you to use mutable states. You need to update the input array so that it contains the changes. Also, they ask you to do that in a minimal number of operations.
So, while it is not idiomatic Scala code, I suggest you a solution allong these lines:
def moveZeroes(nums: Array[Int]): Unit = {
var i = 0
var lastNonZeroFoundAt = 0
while (i < nums.size) {
if(nums(i) != 0) {
nums(lastNonZeroFoundAt) = nums(i)
lastNonZeroFoundAt += 1
}
i += 1
i = lastNonZeroFoundAt
while(i < nums.size) {
nums(i) = 0
i += 1
}
}
As this is non-idomatic Scala, writing such code is not encouraged and thus, a little bit difficult to read. The C++ version that is shown in the solutions may actually be easier to read and help you to understand my code above:
void moveZeroes(vector<int>& nums) {
int lastNonZeroFoundAt = 0;
// If the current element is not 0, then we need to
// append it just in front of last non 0 element we found.
for (int i = 0; i < nums.size(); i++) {
if (nums[i] != 0) {
nums[lastNonZeroFoundAt++] = nums[i];
}
}
// After we have finished processing new elements,
// all the non-zero elements are already at beginning of array.
// We just need to fill remaining array with 0's.
for (int i = lastNonZeroFoundAt; i < nums.size(); i++) {
nums[i] = 0;
}
}
Your answer gives TLE (Time Limit Exceeded) error in LeetCode..I do not know what the criteria is for that to occur..However i see a lot of things in your code that are not perfect .
Pure functional programming discourages use of any mutable state and rather focuses on using val for everything.
I would try it this way --
def moveZeroes(nums: Array[Int]): Array[Int] = {
val nonZero = nums.filter(_ != 0)
val numZero = nums.length - nonZero.length
val zeros = Array.fill(numZero){0}
nonZero ++ zeros
}
P.S - This also gives TLE in Leetcode but still i guess in terms of being functional its better..Open for reviews though.

Assigning variables from a matrix in R

I think I'm incorrectly accessing and assigning variables from a matrix. I understand that arrays, matrices, and tables are different in R. What I want to end up with is an array of values called "c" that has either a 1 or 2 assigning an element from input to either Mew(number 1) or Mewtwo(number 2. I also want the distance from Mew to all other points in an array called dMew as well as dMewtwo an array of the distance from Mewtwo to all other elements in input. What I end up with is NA_real_ for all variables except input. There is a lot of great information on accessing rows or columns of various data structures in R but I'm interested in accessing single elements. Any advice would be most helpful. I apologize if this has been answered before but I couldn't find it anywhere.
#Read input from a csv data file
input = read.csv("~/Desktop/Engineering/iris.csv",head=FALSE)
input = input[c(0:3)]
input = as.matrix(input)
#set random centroids
Mew = input[1,1]
Mewtwo = input[nrow(input),ncol(input)]
#Determine Distance
dist <- function(x, y) {
return(sqrt((x - y)^2))
}
#Determine the clusters
dMew = matrix(,nrow(input), ncol(input))
dMewtwo = matrix(,nrow(input), ncol(input))
c = matrix(,nrow(input), ncol(input))
for (i in 1:nrow(input)) {
for (j in 1:ncol(input)) {
dMew[i,j] = dist(Mew, input[i,j])
dMewtwo[i,j] = dist(Mewtwo, input[i,j])
if (dMew[i,j] > dMewtwo[i,j]) {
c[i,j] = 2
} else {
c[i,j] = 1
}
}
}
#Update the centroids
Mew = mean(dMew)
Mewtwo = mean(dMewtwo)
I have no problem running the code with the following input:
input = data.frame(V1=1:5,V2=1:5,V3=1:5)
so it seems to be a problem related to your data. Also you should avoid using "c" as a variable name and note that dist() is already a function in the stats package. Also, you can avoid the for-loops by using apply() and ifelse():
#Read input from a csv data file
input = data.frame(V1=1:5,V2=1:5,V3=1:5)
input = input[c(1:3)]
input = as.matrix(input)
#set random centroids
Mew = input[1,1]
Mewtwo = input[nrow(input),ncol(input)]
#Determine Distance
dist.eu <- function(x, y) {
return(sqrt((x - y)^2))
}
dMew<-apply(input,c(1,2),dist.eu,Mew)
dMewtwo<-apply(input,c(1,2),dist.eu,Mewtwo)
c.mat<-ifelse(dMew > dMewtwo,2,1)
c.mat

Math::Complex screwing up my array references

I'm trying to optimize some code here, and wrote two different simple subroutines that will subtract one vector from another. I pass a pair of vectors to these subroutines and the subtraction is then performed. The first subroutine uses an intermediary variable to store the result whereas the second one does an inline operation using the '-=' operator. The full code is located at the bottom of this question.
When I use purely real numbers, the program works fine and there are no issues. However, if I am using complex operands, then the original vectors (the ones originally passed to the subroutines) are modified! Why does this program work fine for purely real numbers but do this sort of data modification when using complex numbers?
Note my process:
Generate random vectors (either real or complex depending on the commented out code)
Print the main vectors to the screen
Perform the first subroutine subtraction (using the third variable intermediary within the subroutine)
Print the main vectors to the screen again to prove that they have not changed, no matter the use of real or complex vectors
Perform the second subroutine subtraction (using the inline computation method)
Print the main vectors to the screen again, showing that #main_v1 has changed when using complex vectors, but will not change when using real vectors (#main_v2 is unaffected)
Print the final answers to the subtraction, which are always the correct answers, regardless of real or complex vectors
The issue arises because in the case of the second subroutine (which is quite a bit faster), I don't want the #main_v1 vector changed. I need that vector to do further calculations down the road, so I need it to stay the same.
Any idea on how to fix this, or what I'm doing wrong? My entire code is below, and should be functional. I've been using the CLI syntax shown below to run the program. I choose 5 just to keep everything easy for me to read.
c:\> bench.pl 5 REAL
or
c:\> bench.pl 5 IMAG
#!/usr/local/bin/perl
# when debugging: add -w option above
#
use strict;
use warnings;
use Benchmark qw (:all);
use Math::Complex;
use Math::Trig;
use Time::HiRes qw (gettimeofday);
system('cls');
my $dimension = $ARGV[0];
my $type = $ARGV[1];
if(!$dimension || !$type){
print "bench.pl <n> <REAL | IMAG>\n";
print " <n> indicates the dimension of the vector to generate\n";
print " <REAL | IMAG> dictates to use real or complex vectors\n";
exit(0);
}
my #main_v1;
my #main_v2;
my #vector_sum1;
my #vector_sum2;
for($a=1;$a<=$dimension;$a++){
my $r1 = sprintf("%.0f", 9*rand)+1;
my $r2 = sprintf("%.0f", 9*rand)+1;
my $i1 = sprintf("%.0f", 9*rand)+1;
my $i2 = sprintf("%.0f", 9*rand)+1;
if(uc($type) eq "IMAG"){
# Using complex vectors has the issue
$main_v1[$a] = cplx($r1,$i1);
$main_v2[$a] = cplx($r2,$i2);
}elsif(uc($type) eq "REAL"){
# Using real vectors shows no issue
$main_v1[$a] = $r1;
$main_v2[$a] = $r2;
}else {
print "bench.pl <n> <REAL | IMAG>\n";
print " <n> indicates the dimension of the vector to generate\n";
print " <REAL | IMAG> dictates to use real or complex vectors\n";
exit(0);
}
}
# cmpthese(-5, {
# v1 => sub {#vector_sum1 = vector_subtract(\#main_v1, \#main_v2)},
# v2 => sub {#vector_sum2 = vector_subtract_v2(\#main_v1, \#main_v2)},
# });
# print "\n";
print "main vectors as defined initially\n";
print_vector_matlab(#main_v1);
print_vector_matlab(#main_v2);
print "\n";
#vector_sum1 = vector_subtract(\#main_v1, \#main_v2);
print "main vectors after the subtraction using 3rd variable\n";
print_vector_matlab(#main_v1);
print_vector_matlab(#main_v2);
print "\n";
#vector_sum2 = vector_subtract_v2(\#main_v1, \#main_v2);
print "main vectors after the inline subtraction\n";
print_vector_matlab(#main_v1);
print_vector_matlab(#main_v2);
print "\n";
print "subtracted vectors from both subroutines\n";
print_vector_matlab(#vector_sum1);
print_vector_matlab(#vector_sum2);
sub vector_subtract {
# subroutine to subtract one [n x 1] vector from another
# result = vector1 - vector2
#
my #vector1 = #{$_[0]};
my #vector2 = #{$_[1]};
my #result;
my $row = 0;
my $dim1 = #vector1 - 1;
my $dim2 = #vector2 - 1;
if($dim1 != $dim2){
syswrite STDOUT, "ERROR: attempting to subtract vectors of mismatched dimensions\n";
exit;
}
for($row=1;$row<=$dim1;$row++){$result[$row] = $vector1[$row] - $vector2[$row]}
return(#result);
}
sub vector_subtract_v2 {
# subroutine to subtract one [n x 1] vector from another
# implements the inline subtraction method for alleged speedup
# result = vector1 - vector2
#
my #vector1 = #{$_[0]};
my #vector2 = #{$_[1]};
my $row = 0;
my $dim1 = #vector1 - 1;
my $dim2 = #vector2 - 1;
if($dim1 != $dim2){
syswrite STDOUT, "ERROR: attempting to subtract vectors of mismatched dimensions\n";
exit;
}
for($row=1;$row<=$dim1;$row++){$vector1[$row] -= $vector2[$row]} # subtract inline
return(#vector1);
}
sub print_vector_matlab { # for use with outputting square matrices only
my (#junk) = (#_);
my $dimension = #junk - 1;
print "V=[";
for($b=1;$b<=$dimension;$b++){
# $temp_real = sprintf("%.3f", Re($junk[$b][$c]));
# $temp_imag = sprintf("%.3f", Im($junk[$b][$c]));
# $temp_cplx = cplx($temp_real,$temp_imag);
print "$junk[$b];";
# print "$temp_cplx,";
}
print "];\n";
}
I've even tried modifying the second subroutine so that it has the following lines, and it STILL alters the #main_v1 vector when using complex numbers...I am completely confused as to what's going on.
#result = #vector1;
for($row=1;$row<=$dim1;$row++){$result[$row] -= $vector2[$row]}
return(#result);
and I've tried this too...still modifies #main_V1 with complex numbers
for($row-1;$row<=$dim1;$row++){$result[$row] = $vector1[$row]}
for($row=1;$row<=$dim1;$row++){$result[$row] -= $vector2[$row]}
return(#result);
Upgrade Math::Complex to at least version 1.57. As the changelog explains, one of the changes in that version was:
Add copy constructor and arrange for it to be called appropriately, problem found by David Madore and Alexandr Ciornii.
In Perl, an object is a blessed reference; so an array of Math::Complexes is an array of references. This is not true of real numbers, which are just ordinary scalars.
If you change this:
$vector1[$row] -= $vector2[$row]
to this:
$vector1[$row] = $vector1[$row] - $vector2[$row]
you'll be good to go: that will set $vector1[$row] to refer to a new object, rather than modifying the existing one.

R - Loop in matrix

I have two variables, the first is 1D flow vector containing 230 data and the second is 2D temperature matrix (230*44219).
I am trying to find the correlation matrix between each flow value and corresponding 44219 temperature. This is my code below.
Houlgrave_flow_1981_2000 = window(Houlgrave_flow_average, start = as.Date("1981-11-15"),end = as.Date("2000-12-15"))
> str(Houlgrave_flow_1981_2000)
‘zoo’ series from 1981-11-15 to 2000-12-15
Data: num [1:230] 0.085689 0.021437 0.000705 0 0.006969 ...
Index: Date[1:230], format: "1981-11-15" "1981-12-15" "1982-01-15" "1982-02-15" ...
Hulgrave_SST_1981_2000=X_sst[1:230,]
> str(Hulgrave_SST_1981_2000)
num [1:230, 1:44219] -0.0733 0.432 0.2783 -0.1989 0.1028 ...
sf_Houlgrave_SF_SST = NULL
sst_Houlgrave_SF_SST = NULL
cor_Houlgrave_SF_SST = NULL
for (i in 1:230) {
for(j in 1:44219){
sf_Houlgrave_SF_SST[i] = Houlgrave_flow_1981_2000[i]
sst_Houlgrave_SF_SST[i,j] = Hulgrave_SST_1981_2000[i,j]
cor_Houlgrave_SF_SST[i,j] = cor(sf_Houlgrave_SF_SST[i],Hulgrave_SST_1981_2000[i,j])
}
}
The error message always says:
Error in sst_Houlgrave_SF_SST[i, j] = Hulgrave_SST_1981_2000[i, j] :
incorrect number of subscripts on matrix
Thank you for your help.
try this:
# prepare empty matrix of correct size
cor_Houlgrave_SF_SST <- matrix(nrow=dim(Hulgrave_SST_1981_2000)[1],
ncol=dim(Hulgrave_SST_1981_2000)[2])
# Good practice to not specify "230" or "44219" directly, instead
for (i in 1:dim(Hulgrave_SST_1981_2000)[1]) {
for(j in 1:dim(Hulgrave_SST_1981_2000)[2]){
cor_Houlgrave_SF_SST[i,j] <- cor(sf_Houlgrave_SF_SST[i],Hulgrave_SST_1981_2000[i,j])
}
}
The two redefinitions inside your loop were superfluous I believe. The main problem with your code was not defining the matrix - i.e. the cor variable did not have 2 dimensions, hence the error.
It is apparently also good practice to define empty matrices for results in for-loops by explicitly giving them correct dimensions in advance - is meant to make the code more efficient.

Resources