Schedulling algorithm cpu time - c

I am trying to make a simulator for a scheduling algorithm in C using round robin and fcfs.
I just have a few questions as I have tried to look it up and read the kernel commands but im still that confused :( This program is being done on putty(linux) where you have a list of processes with a time clock that execute or take up cpu time.
How do we make a process take up CPU time? Do we call the sys() function(don't know which one) or are we meant to malloc a process when I read it in my program from a textfile? I know i may sound stupid, but please explain.
What do you suggest is the best data structure to use for the storage of the process(time created,process id,memory size,job time) for ex (0,2,70,8)?
When a process finishes in its job time, how do we terminate it for it to free itself from the CPU to ensure the other process at a clock time after it can use the cpu?
How do you implement the clock time, is there any inbuilt function or to just use a for loop.
I hope these are not asking too many questions but whoever can get back to me I would really appreciate it.
Regards

If you're building a simulator you should NOT be actually waiting that amount of time, you should "schedule" by updating counters and saying process p1 has run for 750ms total so far, scheduled 3 times for 250ms, 250ms, 250ms, etc... Trying to attempt to run a scheduling simulation in real time in user space is bound to give you odd results as your process itself needs to be scheduled as well.
For instance, if you want to simulate FCFS, you implement a simple "process" queue and give them each a time slice (you can use the default kernel timeslice or your own, doesn't really matter) and each of these processes will have some total execution time to finish and you can do calculations based off this. For example P1 is a process, requires 3.12 seconds of CPU time to finish (I don't think memory simulation is needed since we're doing scheduling and not caching or anything else taken into account). You just run the algorithm like you normally would but just adding numbers, so you "run" P1, add time to its counter and check if it's done. If it is check the difference etc... and you can keep a global time to keep track of how long it has run in wall clock time. Then simply put P1 at the end of the queue and "schedule" the next process.
Now if you want to measure scheduling performance that's completely different and this usually involves running workload benchmarks to run many processes on the system and check overall performance metrics for each.

Related

Measuring time of the process/thread in user mode and in kernel on behalf of the process/thread

Imagine process/thread is running from point A to point B.
I can get how much time the code execution took by taking two gettimeofday() and calculating the difference (wall clock time). However, it may happen, that during the 'route' from A to B CPU was switching to another processes, to drivers, kernel, and other stuff it must perform to keep system running.
Is it possible somehow identify how much time A to B took in terms of actual process/thread execution, and kernel time related to their execution?
The goal for this exercise is to actually identify how much time CPU was NOT executing process/thread or its system calls by executing something else that them.
I am using C.
Searching the man-pages from time (1) backwards I found this:
You can use getrusage:
getrusage() returns resource usage measures
can be used for the own process (self), child processes or the calling thread.
Amongst other values it will give you the user CPU time used and the system CPU time used.
Please see man 2 getrusage for the details. (Or use an online replacement like https://linux.die.net/man/2/getrusage)

Timings differ while measuring a CUDA kernel

I try to measure the time taken by a CUDA kernel function. I measure both CPU and GPU timings of it. But I'm getting a huge difference between both.
When I profile it using the NVIDIA profiler the kernel takes around 6ms which is what I want. But when I used gettimeofday() around the kernel call to get the CPU timings, the measure was 15ms. I do not have any memcpy calls there too. The kernel runs in a separate stream. And similar kernels are running in concurrent streams.
Sample code :
gettimeofday(start);
cudaEventRecord(startGPU);
Kernel <<<abc, xyz,stream>>>();
cudaDeviceSynchronize();
cudaEventRecord(stopGPU);
printf("Elapsed GPU time = ");
gettimeofday(stop);
printf("Elapsed CPU time = ");
Results I'm getting for the above code :
Elapsed GPU time = 6ms
Elapsed CPU time = 15ms
It is weird because there is only kernel execution line present. The kernel params are however pointers. Is the extra time being taken by mem copies? But I do not find mem copies anywhere in the profile too. Any leads would be appreciated.
Basically, what you're measuring as your CPU time is the time it takes to
record the first event,
set up the kernel launch with the respective parameters,
send the necessary commands to the GPU,
launch the kernel on the GPU,
execute the kernel on the GPU,
wait for the notification that GPU execution finished to get back to the CPU, and
record the second event.
Also, note that your method of measuring CPU time does not measure just the processing time spent by your process/thread, but, rather, the total system time elapsed (which potentially includes processing time spent by other processes/threads while your process/thread was not necessarily even running). I have to admit that, even in light of all that, the CPU time you report is still much larger compared to the GPU time than I would normally expect. But I'm not sure that that up there really is your entire code. In fact, I rather doubt it, given that, e.g., the printf()s don't really print anything. So there may be some additional factors we're not aware of that would still have to be considered to fully explain your timings.
Anyways, most likely neither of the two measurements you take are actually measuring what you really wanted to measure. If you're interested in the time it takes for the kernel to run, then use CUDA events. However, if you synchronize first and only then record the end event, the time between the start and end events will be the time between the beginning of kernel execution, the CPU waiting for kernel execution to finish, and whatever time it may take to then record the second event and have that one get to the GPU just so you can then ask the GPU at what time it got it. Think of events like markers that mark a specific point in the command stream that is sent to the GPU. Most likely, you actually wanted to write this:
cudaEventRecord(startGPU, stream); // mark start of kernel execution
Kernel<<<abc, xyz, stream>>>();
cudaEventRecord(stopGPU, stream); // mark end of kernel execution
cudaEventSynchronize(stopGPU); // wait for results to be available
and then use cudaEventElapsedTime() to get the time between the two events.
Also, note that gettimeofday() is not necessarily a reliable way of obtaining high-resolution timings. In C++, you could use, e.g., std::steady_clock, or std::high_resolution_clock (I would resort to the latter only if it cannot be avoided, since it is not guaranteed to be steady; and make sure that the clock period is actually sufficient for what you're trying to measure).
After debugging into the same issue, I found that cuda usually takes time before the first kernel launch, as referred in the forum here: https://devtalk.nvidia.com/default/topic/1042733/extremely-slow-cuda-api-calls-/?offset=3.
The cuda runtime APIs before the kernel had 6ms of cudaMalloc and 14ms of cudaLaunch which was the reason for the extra delay. The subsequent kernels, however, was good to work normally. cudaLaunch take usually takes time in microseconds, so if anything goes beyond that, it definitely needs some repair.
NOTE: If you are running any cuda kernels in a while(1) loop (only once), the allocation must be done outside the loop. Else you will end up with delays just like this.

How often does a processor check if time on a timer is up to execute a program?

For example, if we program a computer to check and update some variables every 5 minutes does that mean that the computer actually checks if the condition matches (if that 5 minutes are up so it can execute a program) every tick? So that means (in my point of view) the bigger the amount of conditionals or timers or both the heavier the load on the processor even though the processor just checks if the time is up or whether the condition is match or not.
My reasoning being here that the processor can't really put some task away and forget about it for 5 minutes and then just remember about it and execute the program. It has to keep a track of time (counting seconds or ticks or whatever), keep track of timers that are currently being on and check if the time on every timer is up or not.
That makes every timer a conditional statement. Right?
So the main question is... am I correct on all of those statements or the reality is a bit different and if so then how different?
Thank you.
I am assuming here that you have a basic understanding of how processes are managed in a CPU
Most programming languages implement some form of wait() function, that will cause the CPU to stop executing instructions from that thread until it is interrupted, allowing it to work on other tasks in the meantime. Waiting for an interrupt does not use much of the system's resources, and is much more efficient than the polling method that you were describing.
This is a pretty basic explanation, but if you want to learn more, lookup preemptive multitasking.

How to write a file after a given run time with MPI?

I have a simulation code written in C, parallel with MPI, running on a linux Cluster that kills jobs after 12h of wall time. Jobs that last longer than 12h must then be restarted from a file written by the program.
My code currently write these 'restart files' every N steps of my simulation. It is important that each node is at the same simulation step before writing the restart file.
In my case, these files are big (> 1GB/process) therefore I cannot write the as often as I would need (takes to much time and space).
Also, the execution time of one simulation step depends on what is going on within the simulation, as a result it is quite difficult to predict on many steps my simulation will have done within the 12h. So I cannot either decide to write the restart file after the number of step I think will be done just before the 12h of run time.
As a result, when my job is killed, the last restart file may have been written several hours before, which then forces me do redo a substantial part of the last 12h execution.
Therefore, I am looking for a way to write a restart file as a function of the elapsed run time. I have thought of using MPI_Wtime(), however for a given runtime, say 11:50:00, all processors won't necessarily be at the same phase step... which is not good. Is there a simple solution to that problem ?
Once your processes hit the 11:50:00 mark (or some other suitable deadline), have them AllReduce the number of iterations completed using MPI_MAX. Then they can catch up to exactly that number of iterations, and wait for everyone else to do the same with a simple Barrier. They can then start writing the restart file.

When benchmarking, what causes a lag between CPU time and "elapsed real time"?

I'm using a built-in benchmarking module for some quick and dirty tests. It gives me:
CPU time
system CPU time (actually I never get any result for this with the code I'm running)
the sum of the user and system CPU times (always the same as the CPU time in my case)
the elapsed real time
I didn't even know I needed all that information.
I just want to compare two pieces of code and see which one takes longer. I know that one piece of code probably does more garbage collection than the other but I'm not sure how much of an impact it's going to have.
Any ideas which metric I should be looking at?
And, most importantly, could someone explain why the "elapsed real time" is always longer than the CPU time - what causes the lag between the two?
There are many things going on in your system other than running your Ruby code. Elapsed time is the total real time taken and should not be used for benchmarking. You want the system and user CPU times since those are the times that your process actually had the CPU.
An example, if your process:
used the CPU for one second running your code; then
used the CPU for one second running OS kernel code; then
was swapped out for seven seconds while another process ran; then
used the CPU for one more second running your code,
you would have seen:
ten seconds elapsed time,
two seconds user time,
one second system time,
three seconds total CPU time.
The three seconds is what you need to worry about, since the ten depends entirely upon the vagaries of the process scheduling.
Multitasking operating system, stalls while waiting for I/O, and other moments when you code is not actively working.
You don't want to totally discount clock-on-the-wall time. Time used to wait w/o another thread ready to utilize CPU cycles may make one piece of code less desirable than another. One set of code may take some more CPU time, but, employ multi-threading to dominate over the other code in the real world. Depends on requirements and specifics. My point is ... use all metrics available to you to make your decision.
Also, as a good practice, if you want to compare two pieces of code you should be running as few extraneous processes as possible.
It may also be the case that the CPU time when your code is executing is not counted.
The extreme example is a real-time system where the timer triggers some activity which is always shorter than a timer tick. Then the CPU time for that activity may never be counted (depending on how the OS does the accounting).

Resources