Runtime approximate formula for a program - c

I have a C program which for the following number of inputs executes in the following amount of time:
23 0.001s
100 0.001s
I tried finding a formula for this but was unsuccessful. As far as I can see, the time sometimes doubles sometimes it doesn't, that is why I couldn't find a formula for this.
Any thoughts?
NOTES
1) I am measuring this in CPU time (user+sys).
2) My program uses quicksort
2) The asymptotic run-time analysis/complexity of my program is O(NlogN)

Plotting this it looks very much like you are hitting a "cache cliff" - your data is big enough so that it doesn't all fit in a cpu cache level and so must swap data between cheaper and more expensive levels
The differences at lower levels are likely due to precission - if you increase loops inside the timer block - this might smooth out
generally when hitting a cache cliff the eprformace is something like plotting O*Q
where O is the average runtime -- Nlog(N) in your case
and Q is a step function that increases as you pass the allocated size for each cache.
so Q=1 below L1 size, 10 above L1 size, 100 above L2 size, etc. (numbers only for example)
In fact, there are people that verify their cache sizes by plotting a O(1) function and looking for the memory used at performance cliffs the cliffs:
___
_____-----
L1 | L2 | L3

I always use this to get precise run time..
#include<stdio.h>
#include <time.h>
#include <stdlib.h>
`clock_t startm, stopm;
`#define START if ( (startm = clock()) == -1) {printf("Error calling clock");exit(1);}
`#define STOP if ( (stopm = clock()) == -1) {printf("Error calling clock");exit(1);}
#define PRINTTIME printf( "%6.3f seconds used by the processor.", ((double)stopm-startm)/CLOCKS_PER_SEC);
`int main() {
int i,x;
START;
scanf("%d",&x);
for(i=0;i<10000;i++){
printf("%d\n",i);
}
STOP;
PRINTTIME;
}

Related

Need an algorithm to detect large spikes in oscillating data

I am parsing through data on an sd card one large chunk at a time on a microcontroller. It is accelerometer data so it is oscillating constantly. At certain points huge oscillations occur (shown in graph). I need an algorithm to be able to detect these large oscillations, more so, determine the range of data that contains this spike.
I have some example data:
This is the overall graph, there is only one spike of interest, the first one.
Here it is zoomed in a bit
As you can see it is a large oscillation that produces the spike.
So any algorithm that can scan through a data set and determine the portion of data that contains a spike relative to some threshold would be great. This data set is about 50,000 samples, each sample is 32 bits long. I have ample RAM to be able to hold this much data.
Thanks!
For the following signal:
If you take the absolute value of the differential between two consecutive samples, you get:
That is not quite good enough to unambiguously distinguish from the minor "unsustained" disturbances. But if you then take a simple moving sum (a leaky integrator) of the abs-differentials. Here a window width of 4 diff-samples was used:
The moving average introduces a lag or phase shift, which in cases where the data is stored and processing is not real-time can easily be compensated for by subtracting half the window width from the timing:
For real-time processing if the lag is critical a more sophisticated IIR filter might be appropriate. Anyhow a clear threshold can be selected from this data.
In code for the above data set:
#include <stdio.h>
#include <stdint.h>
#include <stdbool.h>
#include <stdlib.h>
static int32_t dataset[] = { 0,0,0,0,0,3,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,3,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,
0,-10,-15,-5,20,25,50,-10,-20,-30,0,30,5,-5,
0,0,5,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,0,0,0,0,6,0,0,0,0,0,0,0} ;
#define DATA_LEN (sizeof(dataset)/sizeof(*dataset))
#define WINDOW_WIDTH 4
#define THRESHOLD 15
int main()
{
uint32_t window[WINDOW_WIDTH] = {0} ;
int window_index = 0 ;
int window_sum = 0 ;
bool spike = false ;
for( int s = 1; s < DATA_LEN ; s++ )
{
uint32_t diff = abs( dataset[s] - dataset[s-1] ) ;
window_sum -= window[window_index] ;
window[window_index] = diff ;
window_index++ ;
window_index %= WINDOW_WIDTH ;
window_sum += diff ;
if( !spike && window_sum >= THRESHOLD )
{
spike = true ;
printf( "Spike START # %d\n", s - WINDOW_WIDTH / 2 ) ;
}
else if( spike && window_sum < THRESHOLD )
{
spike = false ;
printf( "Spike END # %d\n", s - WINDOW_WIDTH / 2 ) ;
}
}
return 0;
}
The output is:
Spike START # 66
Spike END # 82
https://onlinegdb.com/ryEw69jJH
Comparing the original data with the detection threshold gives:
For your real data, you will need to select a suitable window width and threshold to get the desired result, both of which will depend on the bandwidth and amplitude of the disturbances you wish to detect.
Also you may need to guard against arithmetic overflow if your samples are of sufficient magnitude. They need to be less than 232 / window-width to guarantee no overflow in the integrator. Alternatively you could use floating-point or uint64_t for the window type, or add code to deal with saturation.
You could look at statistical analysis. Calculating the standard deviation over the data set and then checking when your data goes out of bound.
You can choose to do this in two way's; either you use a running average over a fixed (relatively small) number of samples or you take the average over the whole data set. As I see multiple spikes in your set I would suggest the first. This way you can possibly stop processing (and later continue) every time you find a spike.
For your purpose you do not really need to calculate the standard deviation sigma. You could actually leave it at the squared of sigma. This will give you a slight performance optimization not having to calculate the square root.
Some pseudo code:
// The data set.
int x[N];
// The number of samples in your mean and std calculation.
int M <= N;
// Simga at index i over the previous M samples.
int sigma_i = sqrt( sum( pow(x[i] - mean(x,M), 2) ) / M );
// Or the squared of sigma
int sigma_squared_i = sum( pow(x[i] - mean(x,M), 2) ) / M;
The disadvantage of this method is that you need to set a threshold for the value of sigma at which you trigger. However it is very safe to say that when setting the threshold at 4 or 5 times your average sigma you will have an usable system.
Managed to get a working algorithm. Basically, determine the average difference between data points. If my data starts to exceed some multiple of that value consecutively then most likely there is a spike occurring.

Using multithreads to calculate data but it does't reduce the time

My CPU has four cores,MAC os. I use 4 threads to calculate an array. But the time of calculating does't being reduced. If I don't use multithread, the time of calculating is about 52 seconds. But even I use 4 multithreads, or 2 threads, the time doesn't change.
(I know why this happen now. The problem is that I use clock() to calculate the time. It is wrong when it is used in multithread program because this function will multiple the real time based on the num of threads. When I use time() to calculate the time, the result is correct.)
The output of using 2 threads:
id 1 use time = 43 sec to finish
id 0 use time = 51 sec to finish
time for round 1 = 51 sec
id 1 use time = 44 sec to finish
id 0 use time = 52 sec to finish
time for round 2 = 52 sec
id 1 and id 0 is thread 1 and thread 0. time for round is the time of finishing two threads. If I don't use multithread, time for round is also about 52 seconds.
This is the part of calling 4 threads:
for(i=1;i<=round;i++)
{
time_round_start=clock();
for(j=0;j<THREAD_NUM;j++)
{
cal_arg[j].roundth=i;
pthread_create(&thread_t_id[j], NULL, Multi_Calculate, &cal_arg[j]);
}
for(j=0;j<THREAD_NUM;j++)
{
pthread_join(thread_t_id[j], NULL);
}
time_round_end=clock();
int round_time=(int)((time_round_end-time_round_start)/CLOCKS_PER_SEC);
printf("time for round %d = %d sec\n",i,round_time);
}
This is the code inside the thread function:
void *Multi_Calculate(void *arg)
{
struct multi_cal_data cal=*((struct multi_cal_data *)arg);
int p_id=cal.thread_id;
int i=0;
int root_level=0;
int leaf_addr=0;
int neighbor_root_level=0;
int neighbor_leaf_addr=0;
Neighbor *locate_neighbor=(Neighbor *)malloc(sizeof(Neighbor));
printf("id:%d, start:%d end:%d,round:%d\n",p_id,cal.start_num,cal.end_num,cal.roundth);
for(i=cal.start_num;i<=cal.end_num;i++)
{
root_level=i/NUM_OF_EACH_LEVEL;
leaf_addr=i%NUM_OF_EACH_LEVEL;
if(root_addr[root_level][leaf_addr].node_value!=i)
{
//ignore, because this is a gap, no this node
}
else
{
int k=0;
locate_neighbor=root_addr[root_level][leaf_addr].head;
double tmp_credit=0;
for(k=0;k<root_addr[root_level][leaf_addr].degree;k++)
{
neighbor_root_level=locate_neighbor->neighbor_value/NUM_OF_EACH_LEVEL;
neighbor_leaf_addr=locate_neighbor->neighbor_value%NUM_OF_EACH_LEVEL;
tmp_credit += root_addr[neighbor_root_level][neighbor_leaf_addr].g_credit[cal.roundth-1]/root_addr[neighbor_root_level][neighbor_leaf_addr].degree;
locate_neighbor=locate_neighbor->next;
}
root_addr[root_level][leaf_addr].g_credit[cal.roundth]=tmp_credit;
}
}
return 0;
}
The array is very large, each thread calculate part of the array.
Is there something wrong with my code?
It could be a bug, but if you feel the code is correct, then the overhead of parallelization, mutexes and such, might mean the overall performance (runtime) is the same as for the non-parallelized code, for the size of elements to compute against.
It might be an interesting study, to do looped code, single-threaded, and the threaded code, against very large arrays (100k elements?), and see if the results start to diverge to be faster in the parallel/threaded code?
Amdahl's law, also known as Amdahl's argument,[1] is used to find the maximum expected improvement to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors.
https://en.wikipedia.org/wiki/Amdahl%27s_law
You don't always gain speed by multi-threading a program. There is a certain amount of overhead that comes with threading. Unless there is enough inefficiencies in the non-threaded code to make up for the overhead, you'll not see an improvement. A lot can be learned about how multi-threading works even if the program you write ends up running slower.
I know why this happen now. The problem is that I use clock() to calculate the time. It is wrong when it is used in multithread program because this function will multiple the real time based on the num of threads. When I use time() to calculate the time, the result is correct.

L1 Cache Line Size

I am trying to determine the L1 cache line size through a C code on a platform where L1 I D cache are 32 KB each and L2 cache is 2MB.
#include<stdio.h>
#include<stdlib.h>
#include<sys/time.h>
#include<time.h>
#define SIZE 100
long long wall_clock_time();
int main()
{
int *arr=calloc(SIZE,sizeof(int));
register int r,i;
long long before,after;
double time_elapsed;
for(i=0;i<SIZE;i++)
{
before=wall_clock_time();
r=arr[i];
after=wall_clock_time();
time_elapsed=((float)(after - before))/1000000000;
printf("Element Index = %d, Time Taken = %1.4fn",i,time_elapsed);
}
free(arr);
return 0;
}
long long wall_clock_time() {
#ifdef __linux__
struct timespec tp;
clock_gettime(CLOCK_REALTIME, &tp);
return (long long)(tp.tv_nsec + (long long)tp.tv_sec * 1000000000ll);
#else
struct timeval tv;
gettimeofday(&tv, NULL);
return (long long)(tv.tv_usec * 1000 + (long long)tv.tv_sec * 1000000000ll);
#endif
}
Above is a small code snippet that I am using to access elements of an array and trying to determine the jump in access delay at cache line boundaries. However, when I execute the code I get all the timing outputs as 0.000. I have read several threads on stackoverflow regarding this topic but couldn't understand much, hence attempted to write this code.
Can anybody explain to me whether there is an error conceptually or syntactically?
The 0.00 should have hinted that you're measuring something too small. The overhead of calling the measurement function is several magnitudes higher than what you measure.
Instead, measure the overall time it takes you to pass the array, and divide by SIZE to amortize it. Since SIZE is also rather small, you should probably repeat this action several hundreds of times and amortize over the entire thing.
Note that this still won't give you the latency, but rather the throughput of accesses. You'll need to come up with a way to measure the line size from that (try reading from the 2nd level cache, and use the fact the reads to the same line would hit in the L1. By increasing your step, you'll be able to see when your BW stops degrading and stays constant).

Calculation time elapsed by a particular function in C program

I have a code in which i want to calculate the time taken by two sorting algorithms merge sort and quick sort to sort N numbers in microseconds or more precise.
The two times thus calculated will then we outputted to the terminal.
Code(part of code):
printf("THE LIST BEFORE SORTING IS(UNSORTED LIST):\n");
printlist(arr,n);
mergesort(extarr,0,n-1);
printf("THE LIST AFTER SORTING BY MERGE SORT IS(SORTED LIST):\n");
printlist(extarr,n);
quicksort(arr,0,n-1);
printf("THE LIST AFTER SORTING BY QUICK SORT IS(SORTED LIST):\n");
printlist(arr,n);
Help me by providing that how it will be done.I have tried clock_t by taking two variables as start stop and keeping them above and below the function call respectively but this doesnt help at all and always print out the its difference as zero.
Please suggest some other methods or function keeping in mind that it has no problem running in any type of OS.
Thanks for any help in advance.
Method : 1
To calculate total time taken by program You can use linux utility "time".
Lets your program name is test.cpp.
$g++ -o test test.cpp
$time ./test
Output will be like :
real 0m11.418s
user 0m0.004s
sys 0m0.004s
Method : 2
You can also use linux profiling method "gprof" to find the time by different functions.
First you have to compile the program with "-pg" flag.
$g++ -pg -o test test.cpp
$./test
$gprof test gmon.out
PS : gmon.out is default file created by gprof
You can call gettimeofday function in Linux and timeGetTime in Windows. Call these functions before calling your sorting function and after calling your sorting function and take the difference.
Please check the man page for further details. If you are still unable to get some tangible data (as the time taken may be too small due to smaller data sets), better to try to measure the time together for 'n' number of iterations and then deduce the time for a single run or increase the size of the data set to be sorted.
Not sure if you tried the following. I know your original post says that you have tried utilizing the CLOCKS_PER_SEC. Using CLOCKS_PER_SEC and doing (stop-start)/CLOCKS_PER_SEC will allow you get seconds. The double will provide more precision.
#include <time.h>
main()
{
clock_t launch = clock();
//do work
clock_t done = clock();
double diff = (done - launch) / CLOCKS_PER_SEC;
}
The reason to get Zeroas the result is likely the poor resolution of the time source you're using. These time sources typically increment by some 10 to 20 ms. This is poor but that's the way they work. When your sorting is done in less that this time increment, the result will be zero. You may increase this resultion into the 1 ms regime by increasing the systems interrupt frequency. There is no standard way to accomplish this for windows and Linux. They have their individual way.
An even higher resolution can be obtained by a high frequency counter. Windows and Linux do provide access to such counters, but again, the code may look slightly different.
If you deserve one piece of code to run on windows and linux, I'd recommend to perform the time measurement in a loop. Run the code to measure hundreds or even more times in a loop
and capture the time outside the loop. Divide the captured time by the numer of loop cycles and have the result.
Of course: This is for evaluation only. You don't want to have that in final code.
And: Taking into account that the time resolution is in the 1 to 20 ms you should make a good choice of the total time to go for to get decent resolution of you measurement. (Hint: Adjust the loop count to let it go for at least a second or so.)
Example:
clock_t start, end;
printf("THE LIST BEFORE SORTING IS(UNSORTED LIST):\n");
printlist(arr,n);
start = clock();
for(int i = 0; i < 100; i++){
mergesort(extarr,0,n-1);
}
end = clock();
double diff = (end - start) / CLOCKS_PER_SEC;
// and so on...
printf("THE LIST AFTER SORTING BY MERGE SORT IS(SORTED LIST):\n");
printlist(extarr,n);
quicksort(arr,0,n-1);
printf("THE LIST AFTER SORTING BY QUICK SORT IS(SORTED LIST):\n");
printlist(arr,n);
If you are in Linux 2.6.26 or above then getrusage(2) is the most accurate way to go:
#include <sys/time.h>
#include <sys/resource.h>
// since Linux 2.6.26
// The macro is not defined in all headers, but supported if your Linux version matches
#ifndef RUSAGE_THREAD
#define RUSAGE_THREAD 1
#endif
// If you are single-threaded then RUSAGE_SELF is POSIX compliant
// http://linux.die.net/man/2/getrusage
struct rusage rusage_start, rusage_stop;
getrusage(RUSAGE_THREAD, &rusage_start);
...
getrusage(RUSAGE_THREAD, &rusage_stop);
// amount of microseconds spent in user space
size_t user_time = ((rusage_stop.ru_utime.tv_sec - rusage_start.ru_stime.tv_sec) * 1000000) + rusage_stop.ru_utime.tv_usec - rusage_start.ru_stime.tv_usec;
// amount of microseconds spent in kernel space
size_t system_time = ((rusage_stop.ru_stime.tv_sec - rusage_start.ru_stime.tv_sec) * 1000000) + rusage_stop.ru_stime.tv_usec - rusage_start.ru_stime.tv_usec;

OpenCL code runs faster on MBP than on NVIDIA GTX 480

I'm I have come across a strange problem. I'm implementing some linear algebra, only matrix multiplications so far, in OpenCL, and have been testing this on my laptop. The code is really simple:
__kernel void matrix_mult(__global float* a,
__global float* b,
__global float* c,
const int N)
{
int row = get_global_id(1);
int col = get_global_id(0);
float sum = 0.0f;
for (int i = 0; i < N; i++) {
sum += a[row*N+i] * b[i*N+col];
}
c[row*N+col] = sum;
}
I test the hardware by running the code 100 times like this:
clock_t begin=clock();
const unsigned int repeats = 100;
for(int i = 0; i != repeats; i++){
runCL(a, b, results,N, N*N);
}
clock_t end=clock();
On my MBP matrix_multiplications take about 1.2 ms, on matrices of size 512*512 while the same code takes about 3 ms when running on a GTX 480 Linux box. This bothers me since, I would't expect the expensive GTX card to be a little faster than the laptop.
As far as I can see either my code is 'wrong' of I'm timing in some wrong way.
I tried using the event-based timing system in the OpenCL spec, this gave some a bit more realistic results.
cl_event event = {0};
err = clEnqueueNDRangeKernel(cmd_queue, kernel[0], 2, NULL, global_work_size, NULL, 0, NULL, &event);
assert(err == CL_SUCCESS);
cl_int err = clWaitForEvents (1,&event);
cl_ulong start, end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
double executionTimeInMilliseconds = (end - start) * 1.0e-6f;
std::cout << "execution time in milis : " << executionTimeInMilliseconds << std::endl;
Now the GT330M will do the operation in 46ms and the GTX480 does it in 2.5 ms. This then makes for another really interesting question, with PROFILING turned on the GT 330M becomes about 30 times slower, this sorta makes sense, but the GTX480 keeps up the same performance. Can anyone explain why this is?
In timing the original problem, what you're seeing here is that with this naive code, the better specs of the GTX480 are actually hurting you.
The code sample, a first pass at a matrix multiply, is completely dominated by memory bandwidth; each thread is accessing a different element of B which can't be coallesced because of the stride.
The GTX480 has a 3x larger (384 bit) and 2x faster (1840 MHz) memory bus than the GT330M (128bit, 800 MHz). Nominally, that gives a peak bandwidth advantage of 177.4GB/s vs 25.6 GB/s, and since this is memory-bandwidth dominated, you might think that would win. However, because of the non-coalesced reads and the wider memory bus, the b-array accesses are only using 32 bits of that 384 bit memory access, and in the 330M case, only 32 bits out of each 128 bit access. So the effective memory bandwidths for the b access are 14.8GB/s and 6.4GB/s; so now there's only a factor of 2 difference in total memory bandwidth rather than 7 or so, and so much of the advantage of the faster card is being squandered; in addition, that memory bandwidth has to be divided by 10x as many cores, so the latency for each core to get its access and do the calculation is longer. I suspect that if you used larger matrix sizes, you could hide more of the latency and get at closer to the best-possible 2x speedup rather than the 2.5x slowdown you're seeing.
The ultimate solution here is to use a more memory-friendly matrix multiplication algorithm as a benchmark.
The profiling results you're seeing, though, I have no idea about. Perhaps the 330M doesn't have as good hardware support for the profiling, so things have to be implemented in software? Since the GTX numbers are about the same either way, I'd just use the simpler timing approach for now, which since you're not using asynchronous kernels or transfer, should be fine.
I think you're pushing the limits on the timer resolution for Nvidia. Try clGetDeviceInfo() with CL_DEVICE_PROFILING_TIMER_RESOLUTION to check it. With those tiny times I wouldn't really conclude anything.
A few ms could be the difference between initialization routines for each code path, especially when both testing systems have different hardware.
I recommend starting by testing a larger set which requires at least several seconds on both the laptop and the nVidia card.

Resources