Multithreading a series of equations - c

I have a long series of equations that looks something like this except with about 113 t's:
t1 = L1;
t2 = L2 + 5;
t3 = t2 + t1;
t4 = L3
...
t113 = t3 + t4
return t113;
Where L's are input arguments.
It takes a really long time to calculate t113. So I'm trying to split this up into several different threads in an attempt to make this quicker. Problem is I'm not sure how to do this. I tried drawing out the t's in the form of a tree by hand on paper so I could analyse it better, but it grew too large and unwieldy midway.
Are there other ways to make the calculations faster? Thanks.
EDIT: I'm using an 8 core DSP with SYS/BIOS. According to my predecessor, these inverse and forward kinematic equations will take the most time to process. My predecessor also knowingly chose this 8 core DSP as the hardware for implementation. So I'm assuming I should write the code in a way that takes advantage of all 8 cores.

With values that are dependent on other values, you're going to have a very tough time allocating the work to different threads. Then it's also likely that you'll have one thread waiting on another. And to fire off new threads is likely more expensive than calculating only 113 values.
Are you sure it's taking a long time to calculate t113? or is it something else that takes a long time.

I'm assuming that the tasks are time intensive and more that just L2 + L3 or something. If not then the overhead in the threading is going to vastly exceed any minimal gains the threading.
If this was Java then I'd use a Executors.newCachedThreadPool(); which starts a new thread whenever needed and then allow the jobs themselves to submit jobs to the thread-pool and wait for the response. That's a bit of a strange pattern but it would work.
For example:
private final ExecutorService threadPool = Executors.newCachedThreadPool();
...
public class T3 implements Callable<Double> {
public Double call() throws Exception {
Future<Double> t2 = threadPool.submit(new T2());
Future<Double> t1 = threadPool.submit(new T1());
return t2.get() + t1.get();
}
}
Then the final task would be:
Future<Double> t3 = threadPool.submit(new T3());
// this throws some exceptions that need to be caught
double result = t3.get();
threadPool.shutdown();
Then the thread pool would just take care of the results. It would do as much parallelization as it can. Now if the output of the T1 task was used in multiple places, this would not work.
If this is another language, maybe a similar pattern can be used depending on the thread libraries available.

If all the assignments are as simple as the ones you show, a reasonable compiler will reduce it fine. For the parts you show,
return L1 + L2 + L3 + 5, should be all the work it's doing.
Perhaps this could be done in two threads (on two CPUs) like:
T1: L1 + L2
T2: L3 + 5
Parent thread: Add the two results.
But with only 113 additions -- if that's what they are -- and modern computers are very good at adding, probably wont be "faster".

Your simple example would automatically multithread (and optimise the solution path) using Excel multi-threaded calculation.
But you don't give enough specifics to tell whether this would be a sensible approach for your real world application.

Related

Implementing PI control for Teensy Atmega 32u4

I am implementing PID control using the standard libraries of the Teensy Atmega32u4. My control variable is PWM signal. My process variable is the current angular position of a DC motor that is interfaced with a 10kohm potentiometer with code that reads position ADC input on a scale of 0 to 270 degrees. The set point is a laser cut joystick whose handle is also attached to a 10kohm potentiometer that reads angular position in the same manner as the process variable.
My question is how to implement the integral portion of the control scheme. The integral term is given by:
Error = Set Point – Process Variable
Integral = Integral + Error
Control Variable = (Kp * Error) + (Ki * Integral)
But I am unsure as to how to calculate the integral portion. Do we need to account for the amount of time that has passed between samples or just the accumulated error and initialize the integral portion to zero, such that it is truly discretized? Since I'm using C, the Integral term can just be a global variable?
Am I on the right track?
Since Sample time (time after which PID is calculated) is always the same it does not matter whether u divide the integral term with sample time as this sample time will just act as a Ki constant but it is better to divide the integral term by sample time so that if u change the sample time the PID change with the sample time but it is not compulsory.
Here is the PID_Calc function I wrote for my Drone Robotics competition in python. Ignore "[index]" that was an array made by me to make my code generic.
def pid_calculator(self, index):
#calculate current residual error, the drone will reach the desired point when this become zero
self.Current_error[index] = self.setpoint[index] - self.drone_position[index]
#calculating values req for finding P,I,D terms. looptime is the time Sample_Time(dt).
self.errors_sum[index] = self.errors_sum[index] + self.Current_error[index] * self.loop_time
self.errDiff = (self.Current_error[index] - self.previous_error[index]) / self.loop_time
#calculating individual controller terms - P, I, D.
self.Proportional_term = self.Kp[index] * self.Current_error[index]
self.Derivative_term = self.Kd[index] * self.errDiff
self.Intergral_term = self.Ki[index] * self.errors_sum[index]
#computing pid by adding all indiviual terms
self.Computed_pid = self.Proportional_term + self.Derivative_term + self.Intergral_term
#storing current error in previous error after calculation so that it become previous error next time
self.previous_error[index] = self.Current_error[index]
#returning Computed pid
return self.Computed_pid
Here if the link to my whole PID script in git hub.
See if that help u.
Press the up button ig=f u like the answer and do star my Github repository i u like the script in github.
Thank you.
To add to previous answer, also consider the case of integral wind up in your code. There should be some mechanism to reset the integral term, if a windup occurs. Also select the largest available datatype to keep the integram(sum) term, to avoid integral overflow (typically long long). Also take care of integral overflow.
If you are selecting a sufficiently high sampling frequency, division can be avoided to reduce the computation involved. However, if you want to experiment with the sampling time, keep the sampling time in multiples of powers of two, so that the division can be accomplished through shift operations. For example, assume the sampling times selected be 100ms, 50ms, 25ms, 12.5ms. Then the dividing factors can be 1, 1<<1, 1<<2, 1<<4.
It is convenient to keep all the associated variables of the PID controller in a single struct, and then use this struct as parameters in functions operating on that PID. This way, the code will be modular, and many PID loops can simultaneously operate on the microcontroller, using the same code and just different instances of the struct. This approach is especially useful in large robotics projects, where you have many loops to control using a single CPU.

Difference of using different population size and different crossover method

I have couple of general questions on genetic algorithm. In selection step where you pick up chromosomes from the population, is there an ideal number of chromosomes to be picked up? What difference does it make if I pick, say 10 chromosomes instead of 20? Does it have any effect on final result? At mutation stage, I've learnt there are different ways to mutate - Single point crossover, two points crossover, uniform crossover and arithmetic crossover. When should I choose one over the other? I know they sound very basic, but I couldn't find answer anywhere. So I thought I should ask in Stackoverflow.
Thanks
It seems to me that your terminology and concepts are a little bit messed up. Let me clarify.
First of all - there are many ways people call the members of the population: genotype, genome, chromosome, individual, solution... I will use solution for now as it is, in my opinion, the most general term, it is what we are eventually evolve, and also I'm not a biologist so I don't know whether genotype, genome and chromosome somehow differ and if they do what is the difference...
Population
Genetic Algorithms are population-based evolutionary algorithms. The algorithms have (usually) a fixed-sized population of solutions of the problem it is solving.
Genetic operators
There are two principal genetic operators - crossover and mutation. The goal of crossover is to take two (or more in some cases) solutions and combine them to create a solution that has some properties of both, optimally the best of both. The goal of mutation is to create new genetic material that was not previously present in the population by doing a small random change.
The choice of the particular operators, i.e. whether a single-point or multi-point crossover..., is totally problem-dependent. For example, if your solutions are composed of some logical blocks of bits that work together in each block, it might not be a good idea to use uniform crossover because it will destroy these blocks. In such case a single- or multi-point crossover is a better choice and the best choice is probably to restrict the crossover points to be on the boundaries of the blocks only.
You have to try what works best for your problem. Also, you can always use all of them, i.e. by randomly choosing which crossover operator is going to be used each time the crossover is about to be performed. Similarly for mutation.
Modes of operation
Now to your first question about the number of selected solutions. Genetic Algorithms can run in two basic modes - generational mode and steady-state mode.
Generational mode
In generational mode, the whole population is replaced in every generation (iteration) of the algorithm. A simple python-like pseudo-code for a generational-mode GA could look like this:
P = [...] # initial population
while not stopping_condition():
Pc = [] # empty population of children
while len(Pc) < len(P):
a = select(P) # select a solution from P using some selection strategy
b = select(P)
if rand() < crossover_probability:
a, b = crossover(a, b)
if rand() < mutation_probability:
a - mutation(a)
if rand() < mutation_probability:
b = mutation(b)
Pc.append(a)
Pc.append(b)
P = Pc # replace the population with the population of children
Evaluation of the solutions was omitted.
Steady-state mode
In steady-state mode, the population persists and only a few solutions are replaced in each iteration. Again, a simple steady-state GA could look like this:
P = [...] # initial population
while not stopping_condition():
a = select(P) # select a solution from P using some selection strategy
b = select(P)
if rand() < crossover_probability:
a, b = crossover(a, b)
if rand() < mutation_probability:
a - mutation(a)
if rand() < mutation_probability:
b = mutation(b)
replace(P, a) # put a child back into P based on some replacement strategy
replace(P, b)
Evaluation of the solutions was omitted.
So, the number of selected solutions depends on how do you want your algorithm to operate.

Spark speed up multiple join operations

Suppose I have a rule like this:
p(v3,v4) :- t1(k1,v1), t2(k1,v2), t3(v1,v3), t4(v2,v4).
The task is join t1, t2, t3, and t4 together to produce a relation p.
Suppose t1, t2, t3, and t4 are already having same partitioner for their keys.
A common strategy is to join relations one by one, but it will force at least 3 shuffle/repartition operations. Details are below(suppose I have 10 partitions).
1.join: x = t1.join(t2)
2.repartition: x = x.map(lambda (k1, (v1,v2)): (v1,v2)).partitionBy(10)
3.join: x = x.join(t3)
4.repartition: x = x.map(lambda (v1, (v2,v3)): (v2,v3)).partitionBy(10)
5.join: x = x.join(t4)
6.repartition: x = x.map(lambda (v2, (v3,v4)): (v3,v4)).partitionBy(10)
Because t1 to t4 all have same partitioner, and I repartition the intermediate result after every join, each join operations will not involve any shuffle.
However, the intermediate result(i.e. variable x) is huge in my practical code, 3 shuffle operations are still too many for me.
My questions are:
Is there anything wrong with my strategy to evaluate this rule? Is there any better, more efficient solution?
My understanding of shuffle operation is that, for each partition, Spark will do repartition independently, it will generate repartition results for each partition on disk(so-called shuffle write). Then, for each partition, Spark will get new repartition results from disk(so-called shuffle read). If my understanding is correct, each shuffle/repartition will always cost disk read and write. It's kind of a waste, if I can guarantee my memory is huge enough to store all data. Just as said in http://www.trongkhoanguyen.com/2015/04/understand-shuffle-component-in-spark.html. Is there any workaround to disable this kind of shuffle write and read operations? I think my program's performance bottleneck is due to shuffle IO overhead.
Thank you.

Rendering image using Multithread

I have a ray tracing algorithm, which works with only 1 thread and I am trying to make it work with any number of threads.
My question is, which way can I divide this task among threads.
At first my Instructor told me to just divide the width of the image, for example if I have an 8x8 image, and I want 2 threads to do the task, let thread 1 render 0 to 3 horizontal area ( of course all the way down vertically ) and thread 2 render 4 to 7 horizontal area.
I found this approach to work perfect when both my image length and number of threads are powers of 2, but I have no idea how can I deal with odd number of threads or any number of threads that cant divide width without a reminder.
My approach to this problem was to let threads render the image by alternating, for example if I have an 8x8 image, andlets say if I have 3 threads.
thread 1 renders pixels 0,3,6 in horizontal direction
thread 1 renders pixels 1,4,7 in horizontal direction
thread 1 renders pixels 2,5 in horizontal direction
Sorry that I cant provide all my code, since there are more than 5 files with few hundreds line of code in each one.
Here is the for loops that loop trough horizontal area, and the vertical loop is inside these but I am not going to provide it here.
My Instructor`s suggestion
for( int px=(threadNum*(width/nthreads)); px < ((threadNum+1)*(width/nthreads)); ++px )
threadNum is the current thread that I am on (meaning thread 0,1,2 and so on)
width is the width of the image
nthreads is the overall number of threads.
My solution to this problem
for( int px= threadNum; px< width; px+=nthreads )
I know my question is not so clear, and sorry but I cant provide the whole code here, but basically all I am asking is which way is the best way to divide the rendering of the image among given number of threads ( can be any positive number). Also I want threads to render the image by columns, meaning I cant touch the part of the code which handles vertical rendering.
Thank you, and sorry for chaotic question.
First thing, let me tell you that under the assumption that the rendering of each pixel is independent from the other pixels, your task is what in the HPC field is called an "embarassing parallel problem"; that is, a problem that can be efficiently divided between any number of thread (until each thread has a single "unit of work"), without any intercommunication between the processes (which is very good).
That said, it doesn't mean that any parallelization scheme is as good as any other. For your specific problem, I would say that the two main factors to keep in mind are load balancing and cache efficiency.
Load balancing means that you should divide the work among threads in a way that each thread has roughly the same amount of work: in this way you prevent one or more threads from waiting for that one last thread that has to finish it's last job.
E.g.
You have 5 threads and you split your image in 5 big chunks (let's say 5 horizontal strips, but they could be vertical and it wouldn't change the point). Being the problem embarassing parallel, you expect a 5x speedup, and instead you get a meager 1.2x.
The reason might be that your image has most of computationally expensive details in the lower part of the image (I know nothing of rendering, but I assume that a reflective object might take far more time to render than a flat empty space), because is composed by a set of polished metal marbles on the floor on an empty frame.
In this scenario, only one thread (the one with the bottom 1/5 of the image) does all the work anyway, while the other 4 remains idling after finishing their brief tasks.
As you can imagine, this isn't a good parallelization: keeping load balancing in mind alone, the best parallelization scheme would be to assign interleaved pixels to each core for them to process, under the (very reasonable) assumption that the complexity of the image would be averaged on each thread (true for natural images, might yield surprises in very very limited scenarios).
With this solution, your image is eavenly distributed among pixels (statistically) and the worst case scenario is N-1 threads waiting for a single thread to compute a single pixel (you wouldn't notice, performance-wise).
To do that you need to cycle over all pixels forgetting about lines, in this way (pseudo code, not tested):
for(i = thread_num; i < width * height; i+=thread_num)
The second factor, cache efficiency deals with the way computers are designed, specifically, the fact that they have many layers of cache to speed up computations and prevent the CPUs to starve (remain idle while waiting for data), and accessing data in the "right way" can speed up computations considerably.
It's a very complex topic, but in your case, a rule of thumb might be "feeding to each thread the right amount of memory will improve the computation" (emphasys on "right amount" intended...).
It means that, even if passing to each thread interleaved pixels is probably the perfect balancing, it's probably the worst possible memory access pattern you could devise, and you should pass "bigger chunks" to them, because this would keep the CPU busy (note: memory aligment comes also heavily into play: if your image has padding after each line keep them multiples of, say, 32 bytes, like some image formats, you should keep it into consideration!!)
Without expanding an already verbose answer to alarming sizes, this is what I would do (I'm assuming the memory of the image is consecutive, without padding between lines!):
create a program that splits the image into N consecutive pixels (use a preprocessor constant or a command argument for N, so you can change it!) for each of M threads, like this:
1111111122222222333333334444444411111111
do some profiling for various values of N, stepping from 1 to, let's say, 2048, by powers of two (good values to test might be: 1 to get a base line, 32, 64, 128, 256, 512, 1024, 2048)
find out where the perfect balance is between perfect load balancing (N=1), and best caching (N <= the biggest cache line in your system)
a try the program on more than one system, and keep the smalles value of N that gives you the best test results among the machines, in order to make your code run fast everywhere (as the caching details vary among systems).
b If you really really want to squeeze every cycle out of every system you install your code on, forget step 4a, and create a code that automatically finds out the best value of N by rendering a small test image before tackling the appointed task :)
fool around with SIMD instructions (just kidding... sort of :) )
A bit theoretical (and overly long...), but still I hope it helps!
An alternating division of the columns will probably lead to a suboptimal cache usage. The threads should operate on a larger continuous range of data. By the way, if your image is stored row-wise it would also be better to distribute the rows instead of the columns.
This is one way to divide the data equally with any number of threads:
#define min(x,y) (x<y?x:y)
/*...*/
int q = width / nthreads;
int r = width % nthreads;
int w = q + (threadNum < r);
int start = threadNum*q + min(threadNum,r);
for( int px = start; px < start + w; px++ )
/*...*/
The remainder r is distributed over the first r threads. This is important when calculating the start index for a thread.
For the 8x8 image this would lead to:
thread 0 renders columns 0-2
thread 1 renders columns 3-5
thread 2 renders columns 6-7

which size of chunk will yield to best performance using master-worker with MPI?

Im using MPI to parrlel a program that is trying to solve the Metric TSP problem. I have P processors , and N cities to pass .
Each thread asks for work from the master, recieves a chunk - which is a range of permutation that he should check and calculates the minimal among them. I am optimizing this by pruning bad routes in advance.
There are total (N-1)! routes to calculate. each worker get a chunk with a number that represnt the first route he has to check and the also the last. In addition the master sends him the most recent best result known , so can easly prone bad routes in advance with some lower bound on thier remains.
Each time a worker is finding result that is better that the global , he asyncrounsly sends it to the all other workers and to the master.
Im not looking for better solution- I'm just trying to determine which chunk size is the best.
The best chunk size i've found so far is (n!)/(n/2)! , but it doesnt yield so good result .
please help me understand which chunk size is the best here. I'm trying to balance between the amount of computation and communication
thanks
This depends heavily on factors beyond your control: MPI implementation, total load on the machine, etc. However, I'd hazard a guess that it also heavily depends on how many worker processes there are. On that note, understand that MPI spawns processes, not threads.
Ultimately, as is often the case with most optimization questions, the answer is simply "test a lot of different settings and see which one is best". You may want to do this manually, or write a tester app that implements some sort of heuristic (e.g. a genetic algorithm).

Resources