I have a simulation code written in C, parallel with MPI, running on a linux Cluster that kills jobs after 12h of wall time. Jobs that last longer than 12h must then be restarted from a file written by the program.
My code currently write these 'restart files' every N steps of my simulation. It is important that each node is at the same simulation step before writing the restart file.
In my case, these files are big (> 1GB/process) therefore I cannot write the as often as I would need (takes to much time and space).
Also, the execution time of one simulation step depends on what is going on within the simulation, as a result it is quite difficult to predict on many steps my simulation will have done within the 12h. So I cannot either decide to write the restart file after the number of step I think will be done just before the 12h of run time.
As a result, when my job is killed, the last restart file may have been written several hours before, which then forces me do redo a substantial part of the last 12h execution.
Therefore, I am looking for a way to write a restart file as a function of the elapsed run time. I have thought of using MPI_Wtime(), however for a given runtime, say 11:50:00, all processors won't necessarily be at the same phase step... which is not good. Is there a simple solution to that problem ?
Once your processes hit the 11:50:00 mark (or some other suitable deadline), have them AllReduce the number of iterations completed using MPI_MAX. Then they can catch up to exactly that number of iterations, and wait for everyone else to do the same with a simple Barrier. They can then start writing the restart file.
Related
I have a C code, When I try to calculate the time of small pieces of processing code for the first execution. It gives me 30 ms, when I turn of the exe file and run it again it gives me 1 ms, and this time is the time for calculation and each time I run the program the calculation values is different from the previous one, if I turn off the PC and turn it, it gives me 30 ms for the first execution and 1 ms for all other executions
How can I get the same time , I free all the used memory and I run another program to overwrite the memory but the problem is not solved until I reboot the PC
any help
start_time=clock();
Encryption();
end_time=clock();
cpu_time_used_totlal_enc +=(double) (end_time-start_time) / CLOCKS_PER_SEC;
This problem is called "warmup": When you want to do performance test some code, you need to run the code several times (say, 10 times). Then you run it 100'000 times and measure how long it takes and divide that by 100'000 to get the average. A single measurement of the runtime is useless, unless the runtime is at least one minute.
The reason for warmup problems is that modern OSs and languages do all kind of tricks to make your code execute faster. For example, the call to Encryption() might actually invoke a function in a shared library.
Those libraries are loaded lazily, i.e. it's loaded the first time when your code actually calls the function. When it is loaded, the OS keeps it in a cache since chances are someone will need it again.
That's why the first few runs of an application have a completely different runtime than the next 10'000 runs.
As a part of my academic project I have to execute a C program.
I want to get the execution time of the program. For that I have to sleep all other processes in Linux for some seconds. Is there any method for doing that?
(I have tried using the time command in Linux but it is not working properly: it shows different execution time when I am executing the same program. So I am computing execution time by seeing the difference between start time and end time).
About the best way I can think of is to drop to single-user mode, which you get with
# init 1
on pretty much any distribution. This will also stop X, you'll be on a raw console. Handling interrupts from stray mouse movement is likely to be one of the reasons for whatever variability you're seeing, so that's a good thing.
When you want your full system back, init 3 is probably the one, that or init 5.
The usual way to do this is to try to quiesce the machine as much as possible, then take several measurements and average them. It's advisable to discard the first reading, as that's likely to involve population of caches.
It is impossible to get the exact time of execution of a process into a system in which the scheduler commutes from 1 process to the other.
The Intel processors inserted a register that counts the number of clocks, but even so it is impossible to measure the time.
There is a book that you can find as PDF on google, "Computer Systems: A Programmer's Perspective" -- In this book an whole chapter is dedicated to time measurements.
Use the time command. The sum user + sys will give you the time your programm used the CPU directly plus the time the system used the CPU on behalf of your program. I think it is what you want to know.
There will always be a difference in execution time for things no matter how many processes you shut down, polling, IO, background daemons all affect execution priority.
The academic approach would be to run a sizeable sample and take statistics, you might also want to take a look at sar to log the background. To invalidate any readings you might take
Try executing your application with nice -n 20. It may help to make the other processes quieter.
nice man page
I am trying to make a simulator for a scheduling algorithm in C using round robin and fcfs.
I just have a few questions as I have tried to look it up and read the kernel commands but im still that confused :( This program is being done on putty(linux) where you have a list of processes with a time clock that execute or take up cpu time.
How do we make a process take up CPU time? Do we call the sys() function(don't know which one) or are we meant to malloc a process when I read it in my program from a textfile? I know i may sound stupid, but please explain.
What do you suggest is the best data structure to use for the storage of the process(time created,process id,memory size,job time) for ex (0,2,70,8)?
When a process finishes in its job time, how do we terminate it for it to free itself from the CPU to ensure the other process at a clock time after it can use the cpu?
How do you implement the clock time, is there any inbuilt function or to just use a for loop.
I hope these are not asking too many questions but whoever can get back to me I would really appreciate it.
Regards
If you're building a simulator you should NOT be actually waiting that amount of time, you should "schedule" by updating counters and saying process p1 has run for 750ms total so far, scheduled 3 times for 250ms, 250ms, 250ms, etc... Trying to attempt to run a scheduling simulation in real time in user space is bound to give you odd results as your process itself needs to be scheduled as well.
For instance, if you want to simulate FCFS, you implement a simple "process" queue and give them each a time slice (you can use the default kernel timeslice or your own, doesn't really matter) and each of these processes will have some total execution time to finish and you can do calculations based off this. For example P1 is a process, requires 3.12 seconds of CPU time to finish (I don't think memory simulation is needed since we're doing scheduling and not caching or anything else taken into account). You just run the algorithm like you normally would but just adding numbers, so you "run" P1, add time to its counter and check if it's done. If it is check the difference etc... and you can keep a global time to keep track of how long it has run in wall clock time. Then simply put P1 at the end of the queue and "schedule" the next process.
Now if you want to measure scheduling performance that's completely different and this usually involves running workload benchmarks to run many processes on the system and check overall performance metrics for each.
I have an algorithm that takes 7 days to Run To Completion (and few more algorithms too)
Problem: In order to successfully Run the program, I need continuous power supply. And if out of luck, there is a power loss in the middle, I need to restart it again.
So I would like to ask a way using which I can make my program execute in phases (say each phase generates Results A,B,C,...) and now in case of a power loss I can some how use this intermediate results and continue/Resume the Run from that point.
Problem 2: How will i prevent a file from re opening every time a loop iterates ( fopen was placed in a loop that runs nearly a million times , this was needed as the file is being changed with each iteration)
You can separate it in some source files, and use make.
When each result phase is complete, branch off to a new universe. If the power fails in the new universe, destroy it and travel back in time to the point at which you branched. Repeat until all phases are finished, and then merge your results into the original universe via a transcendental wormhole.
Well, couple of options, I guess:
You split your algorithm along sensible lines with this a defined output from a phase that can be the input to the next phase. Then, configure your algorithm as a workflow (ideally soft-configured through some declaration file.
You add logic to your algorithm by which it knows what it has successfully completed (commited). Then, on failure, you can restart the algorithm and it bins all uncommitted data and restarts from the last commit point.
Note that both these options may draw out your 7hr run time further!
So, to improve the overall runtime, could you also separate your algorithm so that it has "worker" components that can work on "jobs" in parallel. This usually means drawing out some "dumb" but intensive logic (such as a computation) that can be parameterised. Then, you have the option of running your algorithm on a grid/ space/ cloud/ whatever. At least you have options to reduce the run time. Doesn't even need to be a space... just use queues (IBM MQ Series has a C interface) and just have listeners on other boxes listening to your jobs queue and processing your results before persisting the results. You can still phase the algorithm as discussed above too.
Problem 2: Opening the file on each iteration of the loop because it's changed
I may not be best qualified to answer this but doing fopen on each iteration (and fclose) presumably seems wasteful and slow. To answer, or have anyone more qualified answer, I think we'd need to know more about your data.
For instance:
Is it text or binary?
Are you processing records or a stream of text? That is, is it a file of records or a stream of data? (you aren't cracking genes are you? :-)
I ask as, judging by your comment "because it's changed each iteration", would you be better using a random-accessed file. By this, I'm guessing you're re-opening to fseek to a point that you may have passed (in your stream of data) and making a change. However, if you open a file as binary, you can fseek through anywhere in the file using fsetpos and fseek. That is, you can "seek" backwards.
Additionally, if your data is record-based or somehow organised, you could also create an index for it. with this, you could use to fsetpos to set the pointer at the index you're interested in and traverse. Thus, saving time in finding the area of data to change. You could even persist your index in an accompanying index file.
Note that you can write plain text to a binary file. Perhaps worth investigating?
Sounds like classical batch processing problem for me.
You will need to define checkpoints in your application and store the intermediate data until a checkpoint is reached.
Checkpoints could be the row number in a database, or the position inside a file.
Your processing might take longer than now, but it will be more reliable.
In general you should think about the bottleneck in your algo.
For problem 2, you must use two files, it might be that your application will be days faster, if you call fopen 1 million times less...
I've written two relatively small programs using C. Both of them comunnicate with each other using textual data. Program A generates some problems from given input, B evaluates them and creates input for another iteration of A.
Here's a bash script that I currently use:
for i in {1..1000}
do
./A data > data2;
./B data2 > data;
done
The problem is that since what A and B do is not very time consuming, most of the time is spent (as I suppose) in starting apps up. When I measure time the script runs I get:
$ time ./bash.sh
real 0m10.304s
user 0m4.010s
sys 0m0.113s
So my main question is: is there any way to communicate data beetwen those two apps faster? I don't want to integrate them into one application, because I'm trying to build a toolset with independent, easly communicating tools (as was suggested in "The Art of Unix Programming" from which I'm learning the way to write reusable software).
PS. The data and data2 files contain sets of data needed in whole at once by those applications (so communicating by for e.g. one line of data at time is impossible).
Thanks for any suggestions.
cheers,
kajman
Can you create named pipe ?
mkfifo data1
mkfifo data2
./A data1 > data2 &
./B data2 > data1
If your application is reading and writing in a loop, this could work :)
If you used pipes to transfer the stdout of program A to the stdin of program B you would remove the need to write the file "data2" each loop.
./A data1 | ./B > data1
Program B would need to have the capability of using input from stdin rather than a specified file.
If you want to make a program run faster, you need to understand what is making the program run slowly. The field of computer science dedicated to measuring the performance of a running program is called profiling.
Once you discover which internal portion of your program is running slow, you can generally speed it up. How you go about speeding up that item depends heavily on what "the slow part" is doing and how it is "being done".
Several people have recommended pipes for moving the data directly from the output of one program into the input of another program. Assuming you rewrite your tools to handle input and output in a piped manner, this might improve performance. Again, it depends on what you are doing and how you are doing it.
For example, if your tool just fixes windows style end-of-lines into unix style end-of-lines, the program might read in one line, waiting for it to be available, check the end-of-line and write out the line with the desired end-of-line. Or the tool might read in all of the data, do a replacement call on each "wrong" end-of-line in memory, and then write out all of the data. With the first solution, piping speeds things up. With the second solution piping doesn't speed up anything.
The reason is is truly so hard to answer such a question is because the fix you need really depends on the code you have, the problem you are trying to solve, and the means by which you are solving it now. In the end, there isn't always a 100% guarantee that the code can be sped up; however, virtually every piece of code has opportunities to be sped up. Use profiling to speed up the parts that are slow, instead of wasting your time working on a part of your program that is only called once, and represents 0.001% of the program's runtime.
Remember if you speed up something that is 0.001% of your program's runtime by 50%, you actually only sped up your entire program by 0.0005%. Use profiling to determine the block of code that's taking up 90% of your runtime and concentrate on it.
I do have to wonder why, if A and B depend on each other to run, do you want them to be part of an independent toolset.
One solution is a compromise between the two:
Create a library that contains A.
Create a library that contains B.
Create a program that spawns two threads, 1 containing A and 2 containing B.
Create a semaphore that tells A to run and another that tells B to run.
After the function that calls A in 1, increment B's semaphore.
After the function that calls B in 2, increment A's semaphore.
Another possibility is to use file locking in your programs:
Make both A and B execute in infinite loops (or however many times you're processing data)
Add code to attempt to lock both files at the beginning of the infinite loop in A and B (if not, sleep and try again so that you don't do anything until you have the lock).
Add code to unlock and sleep for longer than the sleep in step 2 at the end of each loop.
Either of these solve the problem of having the overhead of launching the program between runs.
It's almost certainly not application startup which is the bottleneck. Linux will end up caching large portions of your programs, which means that launching will progressively get faster (to a point) the more times you start your program.
You need to look elsewhere for your bottleneck.