How to optimize valve simulation logic? [closed] - c

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
This is an simple logic-programming and optimisation etude, that I've created for myself, and somewhat stumbled upon it.
I have a numerical simulation of simple scheme. Consider some reservoir (or a capacitor) Cm, which is constantly pumping up with pressure. Lets call its current state Vm:
At its output it has a valve, or a gate, G, that could be open or closed, according to following logic:
Gate opens, when pressure(or voltage) Vm exceeds some threshold, call it Vopen: Vm > Vopen
Gate remains open, while current outrush Ia is greater than some Ihold: Ia > IHold
Gate conducts power only out of reservoir (like a diode)
I am doing numerical ODE solving of this, i.e. determining Vm and Ia at each (equal, small) timestep dt. Have three variants of this:
Variable types:
float Vm=0.0, Ia=0.0, Vopen, Ihold, Ra, dt;
int G=0;
Loop body v1 (serial):
Vm = Vm + (-Ia*G)*dt;
G |= (Vm > Vopen);
Ia = Ia + (Vm*Ra*G)*dt;
G &= (Ia > Ihold);
Loop body v2 (serial, with temp var, ternary conditionals):
int Gv; // temporary var
Vm = Vm + (-Ia*G)*dt;
Gv = (Vm > Vopen) ? 1 : G;
Ia = Ia + (Vm*Ra*Gv)*dt;
G = (Ia > Ihold) ? Gv : 0;
Loop body v3 (parallel, with cache):
// cache new state first
float Vm1 = Vm + (-Ia*G)*dt;
float Ia1 = Ia + (Vm*Ra*G)*dt;
// update memory
G = ( Vm1 > Vopen ) || (( Ia1 > Ihold ) && G);
Vm = Vm1;
Ia = Ia1; // not nesesary to cache, done for readibility
the G was taken from building up the following truth table, plus imagination:
Q:
Which is correct? Are they?
How third variant(parallel logic) differs from first two(serial logic)?
Are there more effective ways of doing this logic?
PS. I am trying to optimize it for SSE, and then (separately) for OpenCL (if that gives optimisation hints)
PPS. For those who is curious, here is my working simulator, involving this gate (html/js)

By the overall description they are the same and should all fulfil your needs.
Your serial code will produce half-steps. That means if you break it down to the discrete description V(t) can be described as being 1/2 dt in front of I(t). The fist one will keep G changing in every half-step and the second one will synchronise it to I. But since you are not evaluating G in between in doesn't really matter. There is also no real problem with V and I being a half step apart but you should keep it in mind but maybe you should use for plotting/evaluation the vector {V(t),(I(t-1)+I(t))/2,G(t)}.
The parallel code will keep them all in the same time step.
For your pure linear problem the direct integration is a good solution. Higher order ode solver won't buy you anything. State-space representation for pure linear systems only write the same direct integration in a different way.
For SIMD optimisation there isn't munch to do. You need to evaluate step-wise since you are updating I by V and V by I. That means you can't run the steps in parallel which makes many interesting optimisation not possible.

Related

Solving a simple puzzle in prolog

I solved a puzzle in C and tried to do the same in Prolog but i'm having some trouble expressing the facts and goals in this language.
The very simplified version of the problem is this: there's two levers in a room. Each lever control a mechanism that can move either forward or backward in four different positions (which i noted 0, 1, 2 or 3). If you move a mechanism four times in the same direction, it'll be in the same position as before.
The lever n°1 move the mechanism n°1 two positions forward.
The lever n°2 move the mechanism n°2 one position forward.
Initially, the mechanism n°1 is in position 2 and the mechanism n°2 is in position 1.
The problem is to find the quickest way to move both mechanisms in position 0 and get the sequence of lever that lead to each solution.
Of course here the problem is trivial and you only need to pull the lever n°1 one time and the lever n°2 three times to have a solution.
Here's a simple code in C which gives the sequence of lever to pull to solve this problem by pulling less than 5 levers:
int pos1 = 2, pos2 = 1;
int main()
{
resolve(0,5);
return 0;
}
void lever1(){
pos1 = (pos1 + 2) % 4;
}
void undolever1(){
pos1 = (pos1 - 2) % 4;
}
void lever2(){
pos2 = (pos2 + 1) % 4;
}
void undolever2(){
pos2 = (pos2 - 1) % 4;
}
void resolve(l, k){
if(k == 0){
return;
}
if(pos1 == 0 && pos2 == 0){
printf("Solution: %d\n", l);
return;
}
if(k>0){
k--;
lever1();
resolve(l*10+1,k);
undolever1();
lever2();
resolve(l*10+2,k);
undolever2();
}
}
My code in Prolog looks like this so far:
lever(l1).
lever(l2).
mechanism(m1).
mechanism(m2).
position(m1,2).
position(m2,1).
pullL1() :- position(m1, mod(position(m1,X)+2,4)).
pullL2() :- position(m2, mod(position(m2,X)+1,4)).
solve(k) :- solve_(k, []).
solve_(0, r) :- !, postion(m1, p1), postion(m2, p2), p1 == 0, p2 == 0.
solve_(k, r) :- k > 0, pullL1(), k1 is k - 1, append(r, [1], r1), solve_(k1, r1).
solve_(k, r) :- k > 0, pullL2(), k1 is k - 1, append(r, [2], r2), solve_(k1, r2).
I'm pretty sure there's multiple problems in this code but I'm not sure how to fix it.
Any help would be really appreciated.
I think this is a very interesting Problem. I suppose you want a general solution -> one lever can move multiple mechanisms. In the case the problem is like yours, where one lever only controls one mechanism the solution is trivial. You just move every lever for the amount of time until the mechanism is at state zero.
But I want to provide a more general solution, so where one lever can move multiple mechanisms. But first a little bit math. Don't worry i'll end up doing an example too.
Lets define
as being n levers and
being m mechanisms. Then lets define every lever by a vector:
where is the amount of steps moves forward.
For the mechanisms we define:
beeing the bias of the mechanisms -> so is in initial state and
being the amount of states for every mechanism. So now we can describe our whole system like this:
wher is the amount of times we have to activate . If we want to set all mechanisms to zero. If you are not familiar with the notation this just means that a%m = b%m.
can we rewritten as:
where k can be any natural number. So we can rewrite our system to an equation system:
prolog can solve for us such a equation system.
(there are different solutions to solve diophantine equation systems look at https://en.wikipedia.org/wiki/Diophantine_equation)
Ok now lets make an example: let say we have two levers and three mechanisms with 4 states. The fist lever moves M1 one forward and M3 two forward. The second lever moves M2 one forward and M3 one forward. M1 is in State 2. M2 is in State 3. M3 is in State 3. So our equation system looks like this:
in prolog we can solve this with the clpfd libary.
?- [library(clpfd)].
and then solve like this:
?- X1+(-4)*K1+2 #= 0, 1*X2+(-4)*K2+3 #= 0, 2*X1+X2+(-4)*K3+3 #= 0,Vs = [X1,X2], Vs ins 0..100,label(Vs).
which gives us the solution
Vs = [2, 1]
-> so X1 = 2 and X2 = 1 which is correct. Prolog can give you more solutions.

Why this OpenMP parallel for loop doesn't work properly?

I would like to implement OpenMP to parallelize my code. I am starting from a very basic example to understand how it works, but I am missing something...
So, my example looks like this, without parallelization:
int main() {
...
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
...
return 0;
}
Where I omitted some parts in the "..." because are not relevant. It works, and if I print the u[] and v[] arrays on a file, I get the expected results.
Now, if I try to parallelize it just by adding:
#include <omp.h>
int main() {
...
omp_set_num_threads(2);
#pragma omp parallel for
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
...
return 0;
}
The code compiles and the program runs, BUT the u[] and v[] arrays are half full of zeros.
If I set omp_set_num_threads( 4 ), I get three quarters of zeros.
If I set omp_set_num_threads( 1 ), I get the expected result.
So it looks like only the first thread is being executed, while not the other ones...
What am I doing wrong?
OpenMP assumes that each iteration of a loop is independent of the others. When you write this:
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
The iteration i of the loop is modifying iteration i+1. Meanwhile, iteration i+1 might be happening at the same time.
Unless you can make the iterations independent, this isn't a good use-case for parallelism.
And, if you think about what Euler's method does, it should be obvious that it is not possible to parallelize the code you're working on in this way. Euler's method calculates the state of a system at time t+1 based on information at time t. Since you cannot knowing what's at t+1 without knowing first knowing t, there's no way to parallelize across the iterations of Euler's method.
u[i+1] = (1+h)*u[i];
v[i+1] = v[i]/(1-h);
is equivalent to
u[i] = pow((1+h), i)*u[0];
v[i] = v[0]*pow(1.0/(1-h), i);
therefore you can parallelize you code like this
#pragma omp parallel for
for (int i = 0; i < n; i++) {
u[i] = pow((1+h), i)*u[0];
v[i] = v[0]*pow(1.0/(1-h), i);
}
If you want to mitigate the cost of the pow function you can do it once per thread rather than once per iteration like his (since t << n).
#pragma omp parallel
{
int nt = omp_get_num_threads();
int t = omp_get_thread_num();
int s = (t+0)*n/nt;
int f = (t+1)*n/nt;
u[s] = pow((1+h), s)*u[0];
v[s] = v[0]*pow(1.0/(1-h), s);
for(int i=s; i<f-1; i++) {
u[i+1] = (1+h)*u[i];
v[i+1] = v[i]/(1-h);
}
}
You can also write your own pow(double, int) function optimized for integer powers.
Note that the relationship I used is not in fact 100% equivalent because floating point arithmetic is not associative. That's not usually a problem but it's something one should be aware of.
Before parallelizing your code you must identify its concurrency, i.e. the set of tasks that are logically happening at the same time and then figure out a way to make them actually happen in parallel.
As mentioned above, this is a not a good example to apply parallelism on due to the fact that there is no concurrency in its nature. Attempting to use parallelism like that will lead to wrong results, due to the so-called race conditions.
If you just wanna learn how OpenMP works, try to come up with examples where you can clearly identify conceptually independent tasks. One of the most simple I can think of would be computing the area under a curve by means of integration.
Welcome to the parallel ( or "just"-concurrent ) plurality of computing realities.
Why?
Any non-sequential schedule of processing the loop will have problems with hidden ( not correctly handled ) breach of data-{-access | -value}
integrity in time.
A pure-[SERIAL] flow of processing is free from such dangers as the principally serialised steps indirectly introduce ( right by a rigid order of executing nothing but a one-step-after-another as a sequence ) order, in which there is no chance to "touch" the same memory location twice or more times at the same time.
This "peace-of-mind" is inadvertently lost, once a process goes into a "just"-[CONCURRENT] or the true-[PARALLEL] processing.
Suddenly there is an almost random order ( in a case of a "just"-[CONCURRENT] ) or a principally "immediate" singularity ( avoiding any original meaning of "order" - in the case of a true-[PARALLEL] code execution mode -- like a robot, having 6DoF, arrives into each and every trajectory-point in a true-[PARALLEL] fashion, driving all 6DoF-axes in parallel, not a one-after-another, in a pure-[SERIAL]-manner, not in a some-now-some-other-later-and-the-rest-as-it-gets in a "just"-[CONCURRENT] fashion, as the 3D-trajectory of robot-arm will become hardly predictable and mutual collisions would be often on a car assembly line ... ).
Solution:
Using either a defensive tool, called atomic operations, or a principal approach - design (b)locking-free algorithm, where possible, or explicitly signal and coordinate reads and writes ( sure, at a cost in excess-time and degraded performance ), so as to warrant the values will not get damaged into an inconsistent digital trash, if protective steps ( ensuring all "old"-writes get safely "through" before any "next"-reads go ahead to grab a "right"-value ) were not coded in ( as was demonstrated above ).
Epilogue:
Using a tool, like OpenMP for problems, where it cannot bring any advantage, will result in spending time and decreased performance ( as there are needs to handle all tool-related overheads, while there is literally zero net-effect of parallelism in cases, where the algorithm does not allow any parallelism to be enjoyed ), so one finally pays ways more then one finally gets.
A good point to learn about OpenMP best practices could be sources for example from Lawrence Livermore National Laboratory ( indeed very competent ) and similar publications on using OpenMP.

Noise cancellation setup - combining the microphone's signals intelligently [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I built a noise cancellation setup with two microphones and two different microphone preamplifiers that go to two different channels of a stereo recording.
Here is a sample
http://filestore.to/?d=U5FN2IH96K
I tried
char ergebnis[80];
sprintf(ergebnis, "%s.neu.raw", Datei);
FILE* ausgabe = fopen(ergebnis, "wb");
FILE* f = fopen(Datei, "rb");
if (f == NULL)
{
return;
}
int i = -1;
int r1 = 0;
int r2 = 0;
int l1 = 0;
int l2 = 0;
int l = 0;
int r = 0;
int wo = 0;
int dif = 0;
while (wo != EOF)
{
wo = getc(f);
i++;
if (i == 0)
{
r1 = (unsigned)wo;
}
if (i == 1)
{
r2 = (unsigned)wo;
r = (r2 << 8) + r1; //r1 | r2 << 8;
}
if (i == 2)
{
l1 = (unsigned)wo;
}
if (i == 3)
{
l2 = (unsigned)wo;
l = (l2 << 8) + l1; //l1 | l2 << 8;
dif = r - (l * 2);
putc((char)( (unsigned)dif & 0xff), ausgabe);
putc((char)(((unsigned)dif >> 8) & 0xff), ausgabe);
i = -1;
}
}
when the magic happens in
dif = r - (l * 2);
But this does not eliminate the noise surrounding it, all it does is create crackling sounds.
How could I approach this task with my setup instead? I prefer practical solutions over "read this paper only the author of the paper understands".
While we are at it, how do I normalize the final mono output to make it as loud as possible without clipping?
I don't know why you would expect this
dif = r - (l * 2);
to cancel noise, but I can tell you why it "create[s] crackling sounds". The value in dif is often going to be out of range of 16-bit audio. When this happens, your simple conversion function:
putc((char)( (unsigned)dif & 0xff), ausgabe);
putc((char)(((unsigned)dif >> 8) & 0xff), ausgabe);
will fail. Instead of a smooth curve, your audio will jump from large positive to large negative values. If that confuses you, maybe this post will help.
Even if you solve that problem, a few things aren't clear, not the least of which is that for active noise canceling to work you usually assume that one mike provides a source of noise and the other provides signal + noise. Which is which in this case? Did you just place two mikes next to each other and hope to hear some sound source with less ambient noise after some simple arithmetic? That won't work, since they are both hearing different combinations of signal and noise (not just in amplitude, but time as well). So you need to answer 1. which mike is the source of signal and which is the source of noise? 2. what kind of noise are you trying to cancel? 3. what distinguishes the mikes in their ability to hear signal and noise? 4. etc.
Update: I'm still not clear on your setup, but here's something that might help:
You might have a setup where your signal is strong in one mike and weak in the other, and a noise is applied to both mikes. In all likelyhood, there will be signal leak into both mikes. Nevertheless, we will assume
l = noise1
r = signal + noise2
Note that I have not assumed the same noise values for l and r, this reflects the reality that the two mikes will be picking up different noise values due to time delays and other factors. However, it is often the case (and may or may not be the case in your setup) that noise1 and noise2 are correlated at low frequencies. Thus, if we have a low pass filter, lp, we can achieve some noise reduction in the low frequencies as follows:
out = r - lp(l) = signal + noise2 - lp(noise1)
This, of course, assumes that the noise level at l and r is the same, which it may or may not be, depending on your setup. You may want to leave a manual parameter for this purpose for manual tuning at the end:
out = r - g*lp(l)
where g is your tuning parameter and close to 1. I believe in some high-end noise reduction systems, g is constantly tuned automatically.
Selecting a cutoff frequency for your lp filter is all that remains. An approximation you could use is that the highest frequency you can cancel has a wavelength equal to 1/4 the distance between the mikes. Of course, I'm REALLY waving my arms with that, because it depends a lot on where the sound is coming from, how directional your mikes are and so on, but it's a starting point.
Sample calculation for mikes that are 3 inches apart:
Speed of sound = 13 397 inches / sec
desired wavelength = 4*3 inches = 12 inches
frequency = 13,397 / 12 = 1116 Hz
So your filter should have a cutoff frequency of 1116 Hz if the mikes are 3 inches apart.
Expect this setup to cancel a significant amount of your signal below the cutoff frequency as well, if there is bleed.

Suggestions on optimizing a Z-buffer implementation?

I'm writing a 3D graphics library as part of a project of mine, and I'm at the point where everything works, but not well enough.
In particular, my main headache is that my pixel fill-rate is horribly slow -- I can't even manage 30 FPS when drawing a triangle that spans half of an 800x600 window on my target machine (which is admittedly an older computer, but it should be able to manage this . . .)
I ran gprof on my executable, and I end up with the following interesting lines:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
43.51 9.50 9.50 vSwap
34.86 17.11 7.61 179944 0.04 0.04 grInterpolateHLine
13.99 20.17 3.06 grClearDepthBuffer
<snip>
0.76 21.78 0.17 624 0.27 12.46 grScanlineFill
The function vSwap is my double-buffer swapping function, and it also performs vsyching, so it makes sense to me that the test program will spend much of its time waiting in there. grScanlineFill is my triangle-drawing function, which creates an edge list and then calls grInterpolateHLine to actually fill in the triangle.
My engine is currently using a Z-buffer to perform hidden surface removal. If we discount the (presumed) vsynch overhead, then it turns out that the test program is spending something like 85% of its execution time either clearing the depth buffer, or writing pixels according to the values in the depth buffer. My depth buffer clearing function is simplicity itself: copy the maximum value of a float into each element. The function grInterpolateHLine is:
void grInterpolateHLine(int x1, int x2, int y, float z, float zstep, int colour) {
for(; x1 <= x2; x1 ++, z += zstep) {
if(z < grDepthBuffer[x1 + y*VIDEO_WIDTH]) {
vSetPixel(x1, y, colour);
grDepthBuffer[x1 + y*VIDEO_WIDTH] = z;
}
}
}
I really don't see how I can improve that, especially considering that vSetPixel is a macro.
My entire stock of ideas for optimization has been whittled down to precisely one:
Use an integer/fixed-point depth buffer.
The problem that I have with integer/fixed-point depth buffers is that interpolation can be very annoying, and I don't actually have a fixed-point number library yet. Any further thoughts out there? Any advice would be most appreciated.
You should have a look at the source code to something like Quake - considering what it could achieve on a Pentium, 15 years ago. Its z-buffer implementation used spans rather than per-pixel (or fragment) depth. Otherwise, you could look at the rasterization code in Mesa.
Hard to really tell what higher order optimizations can be done without seeing the rest of the code. I have a couple of minor observation, though.
There's no need to calculate x1 + y * VIDEO_WIDTH more than once in grInterpolateHLine. i.e.:
void grInterpolateHLine(int x1, int x2, int y, float z, float zstep, int colour) {
int offset = x1 + (y * VIDEO_WIDTH);
for(; x1 <= x2; x1 ++, z += zstep, offset++) {
if(z < grDepthBuffer[offset]) {
vSetPixel(x1, y, colour);
grDepthBuffer[offset] = z;
}
}
}
Likewise, I'm guessing that your vSetPixel does a similar calculation, so you should be able to use the same offset there as well, and then you only need to increment offset and not x1 in each loop iteration. Chances are this can be extended back to the function that calls grInterpolateHLine, and you would then only need to do the multiplication once per triangle.
There are some other things you could do with the depth buffer. Most of the time if the first pixel of the line either fails or passes the depth test, then the rest of the line will have the same result. So after the first test you can write a more efficient assembly block to test the entire line in one shot, then if it passes you can use a more efficient block memory setter to block-set the pixel and depth values instead of doing them one at a time. You would only need to test/set per pixel if the line is only partially occluded.
Also, not sure what you mean by older computer, but if your target computer is multi-core then you can break it up among multiple cores. You can do this for the buffer clearing function as well. It can help quite a bit.
I ended up solving this by replacing the Z-buffer with the Painter's Algorithm. I used SSE to write a Z-buffer implementation that created a bitmask w/the pixels to paint (plus the range optimization suggested by Gerald), and it still ran far too slowly.
Thank you, everyone, for your input.

OPENMP F90/95 Nested DO loops - problems getting improvement over serial implementation

I've done some searching but couldn't find anything that appeared to be related to my question (sorry if my question is redundant!). Anyway, as the title states, I'm having trouble getting any improvement over the serial implementation of my code. The code snippet that I need to parallelize is as follows (this is Fortran90 with OpenMP):
do n=1,lm
do m=1,jm
do l=1,im
sum_u = 0
sum_v = 0
sum_t = 0
do k=1,lm
!$omp parallel do reduction (+:sum_u,sum_v,sum_t)
do j=1,jm
do i=1,im
exp_smoother=exp(-(abs(i-l)/hzscl)-(abs(j-m)/hzscl)-(abs(k-n)/vscl))
sum_u = sum_u + u_p(i,j,k) * exp_smoother
sum_v = sum_v + v_p(i,j,k) * exp_smoother
sum_t = sum_t + t_p(i,j,k) * exp_smoother
sum_u_pert(l,m,n) = sum_u
sum_v_pert(l,m,n) = sum_v
sum_t_pert(l,m,n) = sum_t
end do
end do
end do
end do
end do
end do
Am I running into race condition issues? Or am I simply putting the directive in the wrong place? I'm pretty new to this, so I apologize if this is an overly simplistic problem.
Anyway, without parallelization, the code is excruciatingly slow. To give an idea of the size of the problem, the lm, jm, and im indexes are 60, 401, and 501 respectively. So the parallelization is critical. Any help or links to helpful resources would be very much appreciated! I'm using xlf to compile the above code, if that's at all useful.
Thanks!
-Jen
The obvious place to put the omp pragma is at the very outside loop.
For every (l,m,n), you're calculating a convolution between your perturbed variables and an exponential smoother. Each (l,m,n) calculation is completely independant from the others, so you can put it on the outermost loop. So for instance the simplest thing
!$omp parallel do private(n,m,l,i,j,k,exp_smoother) shared(sum_u_pert,sum_v_pert,sum_t_pert,u_p,v_p,t_p), default(none)
do n=1,lm
do m=1,jm
do l=1,im
do k=1,lm
do j=1,jm
do i=1,im
exp_smoother=exp(-(abs(i-l)/hzscl)-(abs(j-m)/hzscl)-(abs(k-n)/vscl))
sum_u_pert(l,m,n) = sum_u_pert(l,m,n) + u_p(i,j,k) * exp_smoother
sum_v_pert(l,m,n) = sum_v_pert(l,m,n) + v_p(i,j,k) * exp_smoother
sum_t_pert(l,m,n) = sum_t_pert(l,m,n) + t_p(i,j,k) * exp_smoother
end do
end do
end do
end do
end do
end do
gives me a ~6x speedup on 8 cores (using a much reduced problem size of 20x41x41). Given the amount of work there is to do in the loops, even at the smaller size, I assume the reason it's not an 8x speedup involves memory contension or false sharing; for further performance tuning you might want to explicitly break the sum arrays into sub-blocks for each thread, and combine them at the end; but depending on the problem size, having the equivalent of an extra im x jm x lm sized array might not be desirable.
It seems like there's a lot of structure in this problem you aught to be able to explot to speed up even the serial case, but it's easier to say that then to find it; playing around on pen and paper nothing comes to mind in a few minutes, but someone cleverer may spot something.
What you have is a convolution. This can be done with a Fast Fourier Transform in N log2(N) time. Your algorithm is N^2. If you use FFT, one core will probably be enough!

Resources