I'm attempting to generate the Laguerre polynomials, and then evaluate them elementwise over a coordinate array.
Presently my code looks something like:
[X,Y] = meshgrid(x_min:mesh_size:x_max,y_min:mesh_size:y_max);
const_p=0;
const_l=1; %At present these two values don't really matter, any integer will do
coord_r = sqrt(X.^2 + Y.^2)
lag_input = num2str(coord_r.^2)
u_pl = evalin(symengine,['orthpoly::laguerre(',num2str(const_p),',',num2str(const_l),',',lag_input,')']);
However, that returns the following error for the last line;
Error using horzcat
Dimensions of matrices being concatenated are not consistent.
I assumed that this was because the three objects being converted to strings had different sizes, but after making them the same size the problem persists.
I'd rather avoid looping through each element if I can avoid it.
I would go about this slightly differently. How about the below? Note that I changed const_p and const_l from your choices because the resulting Laguerre Polynomial is spectacularly dull otherwise.
const_p = 2;
const_l = 1;
%generate the symbolic polynomial in x
lagpoly=feval(symengine,'orthpoly::laguerre',const_p,const_l,'x');
%Find the polynomical coefficients so we can evaluate using MATLAB's poly
coeff=double(feval(symengine,'coeff',lagpoly));
%generate a matrix the same size as coord_r in the original question
x=rand(512);
%Do the evaluation
u_pl=polyval(coeff,x);
#WalkingRandomly has the best way to do this if you need fast numeric results, which is usually the case. However, if you need exact analytical values, there is a trick that at you can use to avoid a for loop: MuPAD's map function. This is how almost all MuPAD functions must be vectorized as they're usually designed for scalar symbolic variables rather than arrays of numeric values. Here's a basic example:
const_p = 2;
const_l = 1;
mesh_size = 0.2;
x_min = 0;
x_max = 1;
y_min = 0;
y_max = 1;
[X,Y] = meshgrid(x_min:mesh_size:x_max,y_min:mesh_size:y_max);
coord_r = sqrt(X.^2 + Y.^2);
lagpoly = evalin(symengine,['map(' char(sym(coord_r)) ...
',x->orthpoly::laguerre(' char(sym(const_p)) ...
',' char(sym(const_l)) ',x))'])
which returns
lagpoly =
[ 3, 121/50, 47/25, 69/50, 23/25, 1/2]
[ 121/50, 76/25 - (3*2^(1/2))/5, 31/10 - (3*5^(1/2))/5, 16/5 - (3*2^(1/2)*5^(1/2))/5, 167/50 - (3*17^(1/2))/5, 88/25 - (3*26^(1/2))/5]
[ 47/25, 31/10 - (3*5^(1/2))/5, 79/25 - (6*2^(1/2))/5, 163/50 - (3*13^(1/2))/5, 17/5 - (6*5^(1/2))/5, 179/50 - (3*29^(1/2))/5]
[ 69/50, 16/5 - (3*2^(1/2)*5^(1/2))/5, 163/50 - (3*13^(1/2))/5, 84/25 - (9*2^(1/2))/5, 1/2, 92/25 - (3*34^(1/2))/5]
[ 23/25, 167/50 - (3*17^(1/2))/5, 17/5 - (6*5^(1/2))/5, 1/2, 91/25 - (12*2^(1/2))/5, 191/50 - (3*41^(1/2))/5]
[ 1/2, 88/25 - (3*26^(1/2))/5, 179/50 - (3*29^(1/2))/5, 92/25 - (3*34^(1/2))/5, 191/50 - (3*41^(1/2))/5, 4 - 3*2^(1/2)]
Calling double(lagpoly) will convert the result to floating point and you'll see that this is the same as the solution provided by ##WalkingRandomly (given the same inputs). Of course you could probably use the symbolic polynomial or its coefficients to find the same thing manually, though it's unfortunate that polyval isn't overloaded for class sym (there's evalp but it's also not vectorized so it would need to be used in conjunction with map).
Related
According to the available documentation for Chapel, (whole-)array
statements like
A = B + alpha * C; // with A, B, and C being arrays, and alpha some scalar
are implemented in the language as the following forall iteration:
forall (a,b,c) in zip(A,B,C) do
a = b + alpha * c;
Thus, array statements are by default executed by a team of parallel
threads. Unfortunately, this also seems to completely preclude the
(either partial or complete) vectorization of such statements. This
can lead to performance surprises for programmers who are used to languages like Fortran or Python/Numpy (where the default behavior typically is to have array statements be only vectorized).
For codes that use (whole-)array statements with arrays of small to
moderate size, the loss of vectorization (confirmed by Linux hardware
performance counters) and the significant overhead inherent to
parallel threads (which are unsuited to effectively exploit the
fine-grained data-parallelism available in such problems) can result
in significant loss of performance. As an example consider the
following versions of Jacobi iteration that all solve the same problem
on a domain of 300 x 300 zones:
Jacobi_1 employs array-statements, as follows:
/*
* Jacobi_1
*
* This program (adapted from the Chapel distribution) performs
* niter iterations of the Jacobi method for the Laplace equation
* using (whole-)array statements.
*
*/
config var n = 300; // size of n x n grid
config var niter = 10000; // number of iterations to perform
proc main() {
const Domain = {0..n+1,0..n+1}; // domain including boundary points
var iteration = 0; // iteration counter
var X, XNew: [Domain] real = 0.0; // declare arrays:
// X stores approximate solution
// XNew stores the next solution
X[n+1,1..n] = 1.0; // Set south boundary values to 1.0
do {
// compute next approximation
XNew[1..n,1..n] =
( X[0..n-1,1..n] + X[2..n+1,1..n] +
X[1..n,2..n+1] + X[1..n,0..n-1] ) / 4.0;
// update X with next approximation
X[1..n,1..n] = XNew[1..n,1..n];
// advance iteration counter
iteration += 1;
} while (iteration < niter);
writeln("Jacobi computation complete.");
writeln("# of iterations: ", iteration);
} // main
Jacobi_2 employs serial for-loops throughout (i.e. only (auto-)vectorization
by the back-end C-compiler is allowed):
/*
* Jacobi_2
*
* This program (adapted from the Chapel distribution) performs
* niter iterations of the Jacobi method for the Laplace equation
* using (serial) for-loops.
*
*/
config var n = 300; // size of n x n grid
config var niter = 10000; // number of iterations to perform
proc main() {
const Domain = {0..n+1,0..n+1}; // domain including boundary points
var iteration = 0; // iteration counter
var X, XNew: [Domain] real = 0.0; // declare arrays:
// X stores approximate solution
// XNew stores the next solution
for j in 1..n do
X[n+1,j] = 1.0; // Set south boundary values to 1.0
do {
// compute next approximation
for i in 1..n do
for j in 1..n do
XNew[i,j] = ( X[i-1,j] + X[i+1,j] +
X[i,j+1] + X[i,j-1] ) / 4.0;
// update X with next approximation
for i in 1..n do
for j in 1..n do
X[i,j] = XNew[i,j];
// advance iteration counter
iteration += 1;
} while (iteration < niter);
writeln("Jacobi computation complete.");
writeln("# of iterations: ", iteration);
} // main
Jacobi_3, finally, has the innermost loops vectorized and only the
outermost loops threaded:
/*
* Jacobi_3
*
* This program (adapted from the Chapel distribution) performs
* niter iterations of the Jacobi method for the Laplace equation
* using both parallel and serial (vectorized) loops.
*
*/
config var n = 300; // size of n x n grid
config var niter = 10000; // number of iterations to perform
proc main() {
const Domain = {0..n+1,0..n+1}; // domain including boundary points
var iteration = 0; // iteration counter
var X, XNew: [Domain] real = 0.0; // declare arrays:
// X stores approximate solution
// XNew stores the next solution
for j in vectorizeOnly(1..n) do
X[n+1,j] = 1.0; // Set south boundary values to 1.0
do {
// compute next approximation
forall i in 1..n do
for j in vectorizeOnly(1..n) do
XNew[i,j] = ( X[i-1,j] + X[i+1,j] +
X[i,j+1] + X[i,j-1] ) / 4.0;
// update X with next approximation
forall i in 1..n do
for j in vectorizeOnly(1..n) do
X[i,j] = XNew[i,j];
// advance iteration counter
iteration += 1;
} while (iteration < niter);
writeln("Jacobi computation complete.");
writeln("# of iterations: ", iteration);
} // main
Running these codes on a laptop with 2 processor-cores and using two
parallel threads, one finds that Jacobi_1 is (surprisingly)
more than ten times slower than Jacobi_2, which itself is (expectedly)
~1.6 times slower than Jacobi_3.
Unfortunately, this default behavior makes array statements completely
unattractive for my use cases, even for algorithms which would benefit
enormously from the more concise notation, and readability that
(whole-)array statements can provide.
Are there ways for the user in Chapel to change this default behavior?
That is, can a user customize the default parallelization of whole-array
statements in a way that such array statements, as used in Jacobi_1, will
behave either like the code in Jacobi_2 (which would be useful for code development and debugging purposes), or the code in Jacobi_3 (which, among those three, would be the method of choice for production calculations)?
I have tried to achieve this by plugging calls to "vectorizeOnly()" into
the definition of "Domain" above, but to no avail.
Chapel's intent is to support vectorization automatically within the per-task serial loops that are used to implement forall loops (for cases that are legally vectorizable). Yet that capability is not well-supported today, as you note (even the vectorizeOnly() iterator that you're using is only considered prototypical).
I'll mention that we tend to see better vectorization results when using Chapel's LLVM back-end than we do with the (default) C back-end, and that we've seen even better results when utilizing Simon Moll's LLVM-based Region Vectorizer (Saarland University). But we've also seen cases where the LLVM back-end underperforms the C back-end, so your mileage may vary. But if you care about vectorization, it's worth a try.
To your specific question:
Are there ways for the user in Chapel to change this default behavior?
There are. For explicit forall loops, you can write your own parallel iterator which can be used to specify a different implementation strategy for a forall loop than our default iterators use. If you implement one that you like, you can then write (or clone and modify) a domain map (background here) to govern how loops over a given array are implemented by default (i.e., if no iterator is explicitly invoked). This permits end-users to specify different implementation policies for a Chapel array than the ones we support by default.
With respect to your three code variants, I'm noting that the first uses multidimensional zippering which is known to have significant performance problems today. This is the likely main cause of performance differences between it and the others. For example, I suspect that if you rewrote it using the form forall (i,j) in Domain ... and then used +/-1 indexing per-dimension, you'd see a significant improvement (and, I'd guess, performance that's much more comparable to the third case).
For the third, I'd be curious whether the benefits you're seeing are due to vectorization or simply due to multitasking since you've avoided the performance problem of the first and the serial implementation of the second. E.g., have you checked to see whether using the vectorizeOnly() iterator added any performance improvement over the same code without that iterator (or used tools on the binary files to inspect whether vectorization is occurring?)
In any Chapel performance study, make sure to throw the --fast compiler flag. And again, for best vectorization results, you might try the LLVM back-end.
So I'm trying to write a function to generate Hermite polynomials and it's doing something super crazy ... Why does it generate different elements for h when I start with a different n? So inputting Hpoly(2,1) gives
h = [ 1, 2*y, 4*y^2 - 2]
while for Hpoly(3,1) ,
h = [ 1, 2*y, 4*y^2 - 4, 2*y*(4*y^2 - 4) - 8*y]
( (4y^2 - 2) vs (4y^2 - 4) as a third element here )
also, I can't figure out how to actually evaluate the expression. I tried out = subs(h(np1),y,x) but that did nothing.
code:
function out = Hpoly(n, x)
clc;
syms y
np1 = n + 1;
h = [1, 2*y];
f(np1)
function f(np1)
if numel(h) < np1
f(np1 - 1)
h(np1) = 2*y*h(np1-1) - 2*(n-1)*h(np1-2);
end
end
h
y = x;
out = h(np1);
end
-------------------------- EDIT ----------------------------
So I got around that by using a while loop instead. I wonder why the other way didn't work ... (and still can't figure out how to evaluate the expression other than just plug in x from the very beginning ... I suppose that's not that important, but would still be nice to know...)
Sadly, my code isn't as fast as hermiteH :( I wonder why.
function out = Hpoly(n, x)
h = [1, 2*x];
np1 = n + 1;
while np1 > length(h)
h(end+1) = 2*x*h(end) - 2*(length(h)-1)*h(end-1);
end
out = h(end)
end
Why is your code slower? Recursion is not necessarily of Matlab's fortes so you may have improved it by using a recurrence relation. However, hermiteH is written in C and your loop won't be as fast as it could be because you're using a while instead of for and needlessly reallocating memory instead of preallocating it. hermiteH may even use a lookup table for the first coefficients or it might benefit from vectorization using the explicit expression. I might rewrite your function like this:
function h = Hpoly(n,x)
% n - Increasing sequence of integers starting at zero
% x - Point at which to evaluate polynomial, numeric or symbolic value
mx = max(n);
h = cast(zeros(1,mx),class(x)); % Use zeros(1,mx,'like',x) in newer versions of Matlab
h(1) = 1;
if mx > 0
h(2) = 2*x;
end
for i = 2:length(n)-1
h(i+1) = 2*x*h(i)-2*(i-1)*h(i-1);
end
You can then call it with
syms x;
deg = 3;
h = Hpoly(0:deg,x)
which returns [ 1, 2*x, 4*x^2 - 2, 2*x*(4*x^2 - 2) - 8*x] (use expand on the output if you want). Unfortunately, this won't be much faster if x is symbolic.
If you're only interested in numeric results of the the polynomial evaluated at particular values, then it's best to avoid symbolic math altogether. The function above valued for double precision x will be three to four orders of magnitude faster than for symbolic x. For example:
x = pi;
deg = 3;
h = Hpoly(0:deg,x)
yields
h =
1.0e+02 *
0.010000000000000 0.062831853071796 0.374784176043574 2.103511015993210
Note:
The hermiteH function is R2015a+, but assuming that you still have access to the Symbolic Math toolbox and the Matlab version is R2012b+, you can also try calling MuPAD's orthpoly::hermite. hermiteH used this function under the hood. See here for details on how to call MuPAD functions from Matlab. This function is a bit simpler in that it only returns a single term. Using a for loop:
syms x;
deg = 2;
h = sym(zeros(1,deg+1));
for i = 1:deg+1
h(i) = feval(symengine,'orthpoly::hermite',i-1,x);
end
Alternatively, you can use map to vectorize the above:
deg = 2;
h = feval(symengine,'map',0:deg,'n->orthpoly::hermite(n,x)');
Both return [ 1, 2*x, 4*x^2 - 2].
OP UPDATE: Note that in the latest version of Julia (v0.5), the idiomatic approach to answering this question is to just define mysquare(x::Number) = x^2. The vectorised case is covered using automatic broadcasting, i.e. x = randn(5) ; mysquare.(x). See also the new answer explaining dot syntax in more detail.
I am new to Julia, and given my Matlab origins, I am having some difficulty determining how to write "good" Julia code that takes advantage of multiple dispatch and Julia's type system.
Consider the case where I have a function that provides the square of a Float64. I might write this as:
function mysquare(x::Float64)
return(x^2);
end
Sometimes, I want to square all the Float64s in a one-dimentional array, but don't want to write out a loop over mysquare everytime, so I use multiple dispatch and add the following:
function mysquare(x::Array{Float64, 1})
y = Array(Float64, length(x));
for k = 1:length(x)
y[k] = x[k]^2;
end
return(y);
end
But now I am sometimes working with Int64, so I write out two more functions that take advantage of multiple dispatch:
function mysquare(x::Int64)
return(x^2);
end
function mysquare(x::Array{Int64, 1})
y = Array(Float64, length(x));
for k = 1:length(x)
y[k] = x[k]^2;
end
return(y);
end
Is this right? Or is there a more ideomatic way to deal with this situation? Should I use type parameters like this?
function mysquare{T<:Number}(x::T)
return(x^2);
end
function mysquare{T<:Number}(x::Array{T, 1})
y = Array(Float64, length(x));
for k = 1:length(x)
y[k] = x[k]^2;
end
return(y);
end
This feels sensible, but will my code run as quickly as the case where I avoid parametric types?
In summary, there are two parts to my question:
If fast code is important to me, should I use parametric types as described above, or should I write out multiple versions for different concrete types? Or should I do something else entirely?
When I want a function that operates on arrays as well as scalars, is it good practice to write two versions of the function, one for the scalar, and one for the array? Or should I be doing something else entirely?
Finally, please point out any other issues you can think of in the code above as my ultimate goal here is to write good Julia code.
Julia compiles a specific version of your function for each set of inputs as required. Thus to answer part 1, there is no performance difference. The parametric way is the way to go.
As for part 2, it might be a good idea in some cases to write a separate version (sometimes for performance reasons, e.g., to avoid a copy). In your case however you can use the in-built macro #vectorize_1arg to automatically generate the array version, e.g.:
function mysquare{T<:Number}(x::T)
return(x^2)
end
#vectorize_1arg Number mysquare
println(mysquare([1,2,3]))
As for general style, don't use semicolons, and mysquare(x::Number) = x^2 is a lot shorter.
As for your vectorized mysquare, consider the case where T is a BigFloat. Your output array, however, is Float64. One way to handle this would be to change it to
function mysquare{T<:Number}(x::Array{T,1})
n = length(x)
y = Array(T, n)
for k = 1:n
#inbounds y[k] = x[k]^2
end
return y
end
where I've added the #inbounds macro to boost speed because we don't need to check the bound violation every time — we know the lengths. This function could still have issues in the event that the type of x[k]^2 isn't T. An even more defensive version would perhaps be
function mysquare{T<:Number}(x::Array{T,1})
n = length(x)
y = Array(typeof(one(T)^2), n)
for k = 1:n
#inbounds y[k] = x[k]^2
end
return y
end
where one(T) would give 1 if T is an Int, and 1.0 if T is a Float64, and so on. These considerations only matter if you want to make hyper-robust library code. If you really only will be dealing with Float64s or things that can be promoted to Float64s, then it isn't an issue. It seems like hard work, but the power is amazing. You can always just settle for Python-like performance and disregard all type information.
As of Julia 0.6 (c. June 2017), the "dot syntax" provides an easy and idiomatic way to apply a function to a scalar or an array.
You only need to provide the scalar version of the function, written in the normal way.
function mysquare{x::Number)
return(x^2)
end
Append a . to the function name (or preprend it to the operator) to call it on every element of an array:
x = [1 2 3 4]
x2 = mysquare(2) # 4
xs = mysquare.(x) # [1,4,9,16]
xs = mysquare.(x*x') # [1 4 9 16; 4 16 36 64; 9 36 81 144; 16 64 144 256]
y = x .+ 1 # [2 3 4 5]
Note that the dot-call will handle broadcasting, as in the last example.
If you have multiple dot-calls in the same expression, they will be fused so that y = sqrt.(sin.(x)) makes a single pass/allocation, instead of creating a temporary expression containing sin(x) and forwarding it to the sqrt() function. (This is different from Matlab/Numpy/Octave/Python/R, which don't make such a guarantee).
The macro #. vectorizes everything on a line, so #. y=sqrt(sin(x)) is the same as y = sqrt.(sin.(x)). This is particularly handy with polynomials, where the repeated dots can be confusing...
I have an array of doubles which is the result of the FFT applied on an array, that contains the audio data of a Wav audio file in which i have added a 1000Hz tone.
I obtained this array thought the DREALFT defined in "Numerical Recipes".(I must use it).
(The original array has a length that is power of two.)
Mine array has this structure:
array[0] = first real valued component of the complex transform
array[1] = last real valued component of the complex transform
array[2] = real part of the second element
array[3] = imaginary part of the second element
etc......
Now, i know that this array represent the frequency domain.
I want to determine and kill the 1000Hz frequency.
I have tried this formula for finding the index of the array which should contain the 1000Hz frequency:
index = 1000. * NElements /44100;
Also, since I assume that this index refers to an array with real values only, i have determined the correct(?) position in my array, that contains imaginary values too:
int correctIndex=2;
for(k=0;k<index;k++){
correctIndex+=2;
}
(I know that surely there is a way easier but it is the first that came to mind)
Then, i find this value: 16275892957.123705, which i suppose to be the real part of the 1000Hz frequency.(Sorry if this is an imprecise affermation but at the moment I do not care to know more about it)
So i have tried to suppress it:
array[index]=-copy[index]*0.1f;
I don't know exactly why i used this formula but is the only one that gives some results, in fact the 1000hz tone appears to decrease slightly.
This is the part of the code in question:
double *copy = malloc( nCampioni * sizeof(double));
int nSamples;
/*...Fill copy with audio data...*/
/*...Apply ZERO PADDING and reach the length of 8388608 samples,
or rather 8388608 double values...*/
/*Apply the FFT (Sure this works)*/
drealft(copy - 1, nSamples, 1);
/*I determine the REAL(?) array index*/
i= 1000. * nSamples /44100;
/*I determine MINE(?) array index*/
int j=2;
for(k=0;k<i;k++){
j+=2;
}
/*I reduce the array value, AND some other values aroud it as an attempt*/
for(i=-12;i<12;i+=2){
copy[j-i]=-copy[i-j]*0.1f;
printf("%d\n",j-i);
}
/*Apply the inverse FFT*/
drealft(copy - 1, nSamples, -1);
/*...Write the audio data on the file...*/
NOTE: for simplicity I omitted the part where I get an array of double from an array of int16_t
How can i determine and totally kill the 1000Hz frequency?
Thank you!
As Oli Charlesworth writes, because your target frequency is not exactly one of the FFT bins (your index, TargetFrequency * NumberOfElements / SamplingRate, is not exactly an integer), the energy of the target frequency will be spread across all bins. For a start, you can eliminate some of the frequency by zeroing the bin closest to the target frequency. This will of course affect other frequencies somewhat too, since it is slightly off target. To better suppress the target frequency, you will need to consider a more sophisticated filter.
However, for educational purposes: To suppress the frequency corresponding to a bin, simply set that bin to zero. You must set both the real and the imaginary components of the bin to zero, which you can do with:
copy[index*2 + 0] = 0;
copy[index*2 + 1] = 1;
Some notes about this:
You had this code to calculate the position in the array:
int correctIndex = 2;
for (k = 0; k < index; k++) {
correctIndex += 2;
}
That is equivalent to:
correctIndex = 2*(index+1);
I believe you want 2*index, not 2*(index+1). So you were likely reducing the wrong bin.
At one point in your question, you wrote array[index] = -copy[index]*0.1f;. I do not know what array is. You appeared to be working in place in copy. I also do not know why you multiplied by 1/10. If you want to eliminate a frequency, just set it to zero. Multiplying it by 1/10 only reduces it to 10% of its original magnitude.
I understand that you must pass copy-1 to drealft because the Numerical Recipes code uses one-based indexing. However, the C standard does not support the way you are doing it. The behavior of the expression copy-1 is not defined by the standard. It will work in most C implementations. However, to write supported portable code, you should do this instead:
// Allocate one extra element.
double *memory = malloc((nCampioni+1) * sizeof *memory);
// Make a pointer that is convenient for your work.
double *copy = memory+1;
…
// Pass the necessary base address to drealft.
drealft(memory, nSamples, 1);
// Suppress a frequency.
copy[index*2 + 0] = 0;
copy[index*2 + 1] = 0;
…
// Free the memory.
free(memory);
One experiment I suggest you consider is to initialize an array with just a sine wave at the desired frequency:
for (i = 0; i < nSamples; ++i)
copy[i] = sin(TwoPi * Frequency / SampleRate * i);
(TwoPi is of course 2*3.1415926535897932384626433.) Then apply drealft and look at the results. You will see that much of the energy is at a peak in the closest bin to the target frequency, but much of it has also spread to other bins. Clearly, zeroing a single bin and performing the inverse FFT cannot eliminate all of the frequency. Also, you should see that the peak is in the same bin you calculated for index. If it is not, something is wrong.
I am trying to make an array of 3 floats in Actionscript 1.0, but instead of incrementing the X & Y variables by 1, it just adds 1 to the end of the previous value. This has nothing to do with Flash, it is being used for an extension for a server that requires extensions in Actionscript 1.0.
var uVars = [];
uVars.X = 250;
uVars.Y = 3;
uVars.Z = 250;
uVars.X += 1;
uVars.Y += 0;
uVars.Z += 1;
trace(uVars.X);
Show us your output, please. I assume you mean "X & Z," not "X & Y."
I don't have AS1 handy, but I'm going to make a bet that the numbers are being treated as strings, since ECMAScript uses "+" for both addition and string concatenation. You need to find some AS1 way to make sure the interpreter knows you are talking about numbers, not strings.
What happens if you put a .0 after all your numbers? e.g. 250.0?
Note: I just looked it up, and parseInt and parseFloat have been available since AS1.
parseInt
parseFloat