I am trying to understand the effects of caching programmatically using the following program. I am getting segfault with the code. I used GDB (compiled with -g -O0) and found that it was segmentation faulting on
start = clock() (first occourance)
Am I doing something wrong? The code looks fine to me. Can someone point out the mistake?
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#define MAX_SIZE (16*1024*1024)
int main()
{
clock_t start, end;
double cpu_time;
int i = 0;
int arr[MAX_SIZE];
/* CPU clock ticks count start */
start = clock();
/* Loop 1 */
for (i = 0; i < MAX_SIZE; i++)
arr[i] *= 3;
/* CPU clock ticks count stop */
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("CPU time for loop 1 %.6f secs.\n", cpu_time);
/* CPU clock ticks count start */
start = clock();
/* Loop 2 */
for (i = 0; i < MAX_SIZE; i += 16)
arr[i] *= 3;
/* CPU clock ticks count stop */
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("CPU time for loop 2 %.6f secs.\n", cpu_time);
return 0;
}
The array might be too big for the stack. Try making it static instead, so it goes into the global variable space. As an added bonus, static variables are initialized to all zero.
Unlike other kinds of storage, the compiler can check that resources exist for globals at compile time (and the OS can double check at runtime before the program starts) so you don't need to handle out of memory errors. An uninitialized array won't make your executable file bigger.
This is an unfortunate rough edge of the way the stack works. It lives in a fixed-size buffer, set by the program executable's configuration according to the operating system, but its actual size is seldom checked against the available space.
Welcome to Stack Overflow land!
Try to change:
int arr[MAX_SIZE];
to:
int *arr = (int*)malloc(MAX_SIZE * sizeof(int));
As Potatoswatter suggested The array might be too big for the stack... You might allocate on the heap, than on the stack...
More informations.
Related
I am trying to find the time taken by memmove function in c using time.h library. However, When i execute the code, I get the value as zero. Any possible solution to find the time taken by the memmove function?
void main(){
uint64_t start,end;
uint8_t a,b;
char source[5000];
char dest[5000];
uint64_t j=0;
for(j=0;j<5000;j++){
source[j]=j;
}
start=clock();
memmove(dest,source,5000);
end=clock();
printf("%f",((double)end-start));
}
As I write in my comment, memmoving 5000 bytes is far too fast to be mesurable with clock. If you do your memmove 100000 times, then it will get mesurable.
This code below gives an output of 12 on my computer. But this is platform dependent, the number you get on your computer might be quite different.
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
int main(void) {
uint64_t start, end;
char source[5000];
char dest[5000];
uint64_t j = 0;
for (j = 0; j < 5000; j++) {
source[j] = j;
}
start = clock();
for (int i = 0; i < 100000; i++)
{
memmove(dest, source, 5000);
}
end = clock();
printf("%lld", (end - start)); // no need to convert to double, (end - start)
// is an uint64_t.
}
If you want to know the time it takes on a beagle bone or another device with a GIPO you can toggle a GPIO before and after your routine. You will have to hook up an oscilloscope or something similar that can sample voltage quickly.
I don't know much about beagle bones but it seems the library libpruio allows fast gpio toggling.
Also, what is your exact goal here? Compare speed on different hardware? As someone suggests, you could increase the number of loops so it becomes more easily measurable with time.h.
I want to measure memory bandwidth using memcpy. I modified the code from this answer:why vectorizing the loop does not have performance improvement which used memset to measure the bandwidth. The problem is that memcpy is only slighly slower than memset when I expect it to be about two times slower since it operations on twice the memory.
More specifically, I run over 1 GB arrays a and b (allocated will calloc) 100 times with the following operations.
operation time(s)
-----------------------------
memset(a,0xff,LEN) 3.7
memcpy(a,b,LEN) 3.9
a[j] += b[j] 9.4
memcpy(a,b,LEN) 3.8
Notice that memcpy is only slightly slower then memset. The operations a[j] += b[j] (where j goes over [0,LEN)) should take three times longer than memcpy because it operates on three times as much data. However it's only about 2.5 as slow as memset.
Then I initialized b to zero with memset(b,0,LEN) and test again:
operation time(s)
-----------------------------
memcpy(a,b,LEN) 8.2
a[j] += b[j] 11.5
Now we see that memcpy is about twice as slow as memset and a[j] += b[j] is about thrice as slow as memset like I expect.
At the very least I would have expected that before memset(b,0,LEN) that memcpy would be slower because the of lazy allocation (first touch) on the first of the 100 iterations.
Why do I only get the time I expect after memset(b,0,LEN)?
test.c
#include <time.h>
#include <string.h>
#include <stdio.h>
void tests(char *a, char *b, const int LEN){
clock_t time0, time1;
time0 = clock();
for (int i = 0; i < 100; i++) memset(a,0xff,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
memset(b,0,LEN);
time0 = clock();
for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
}
main.c
#include <stdlib.h>
int tests(char *a, char *b, const int LEN);
int main(void) {
const int LEN = 1 << 30; // 1GB
char *a = (char*)calloc(LEN,1);
char *b = (char*)calloc(LEN,1);
tests(a, b, LEN);
}
Compile with (gcc 6.2) gcc -O3 test.c main.c. Clang 3.8 gives essentially the same result.
Test system: i7-6700HQ#2.60GHz (Skylake), 32 GB DDR4, Ubuntu 16.10. On my Haswell system the bandwidths make sense before memset(b,0,LEN) i.e. I only see a problem on my Skylake system.
I first discovered this issue from the a[j] += b[k] operations in this answer which was overestimating the bandwidth.
I came up with a simpler test
#include <time.h>
#include <string.h>
#include <stdio.h>
void __attribute__ ((noinline)) foo(char *a, char *b, const int LEN) {
for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
}
void tests(char *a, char *b, const int LEN) {
foo(a, b, LEN);
memset(b,0,LEN);
foo(a, b, LEN);
}
This outputs.
9.472976
12.728426
However, if I do memset(b,1,LEN) in main after calloc (see below) then it outputs
12.5
12.5
This leads me to to think this is a OS allocation issue and not a compiler issue.
#include <stdlib.h>
int tests(char *a, char *b, const int LEN);
int main(void) {
const int LEN = 1 << 30; // 1GB
char *a = (char*)calloc(LEN,1);
char *b = (char*)calloc(LEN,1);
//GCC optimizes memset(b,0,LEN) away after calloc but Clang does not.
memset(b,1,LEN);
tests(a, b, LEN);
}
The point is that malloc and calloc on most platforms don't allocate memory; they allocate address space.
malloc etc work by:
if the request can be fulfilled by the freelist, carve a chunk out of it
in case of calloc: the equivalent ofmemset(ptr, 0, size) is issued
if not: ask the OS to extend the address space.
For systems with demand paging (COW) (an MMU could help here), the second options winds downto:
create enough page table entries for the request, and fill them with a (COW) reference to /dev/zero
add these PTEs to the address space of the process
This will consume no physical memory, except only for the Page Tables.
Once the new memory is referenced for read, the read will come from /dev/zero. The /dev/zero device is a very special device, in this case mapped to every page of the new memory.
but, if the new page is written, the COW logic kicks in (via a page fault):
physical memory is allocated
the /dev/zero page is copied to the new page
the new page is detached from the mother page
and the calling process can finally do the update which started all this
Your b array probably was not written after mmap-ing (huge allocation requests with malloc/calloc are usually converted into mmap). And whole array was mmaped to single read-only "zero page" (part of COW mechanism). Reading zeroes from single page is faster than reading from many pages, as single page will be kept in the cache and in TLB. This explains why test before memset(0) was faster:
This outputs. 9.472976 12.728426
However, if I do memset(b,1,LEN) in main after calloc (see below) then it outputs: 12.5 12.5
And more about gcc's malloc+memset / calloc+memset optimization into calloc (expanded from my comment)
//GCC optimizes memset(b,0,LEN) away after calloc but Clang does not.
This optimization was proposed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57742 (tree-optimization PR57742) at 2013-06-27 by Marc Glisse (https://stackoverflow.com/users/1918193?) as planned for 4.9/5.0 version of GCC:
memset(malloc(n),0,n) -> calloc(n,1)
calloc can sometimes be significantly faster than malloc+bzero because it has special knowledge that some memory is already zero. When other optimizations simplify some code to malloc+memset(0), it would thus be nice to replace it with calloc. Sadly, I don't think there is a way to do a similar optimization in C++ with new, which is where such code most easily appears (creating std::vector(10000) for instance). And there would also be the complication there that the size of the memset would be a bit smaller than that of the malloc (using calloc would still be fine, but it gets harder to know if it is an improvement).
Implemented at 2014-06-24 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57742#c15) - https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=211956 (also https://patchwork.ozlabs.org/patch/325357/)
tree-ssa-strlen.c ...
(handle_builtin_malloc, handle_builtin_memset): New functions.
The current code in gcc/tree-ssa-strlen.c https://github.com/gcc-mirror/gcc/blob/7a31ada4c400351a35ab65f8dc0357e7c88805d5/gcc/tree-ssa-strlen.c#L1889 - if memset(0) get pointer from malloc or calloc, it will convert malloc into calloc and then memset(0) will be removed:
/* Handle a call to memset.
After a call to calloc, memset(,0,) is unnecessary.
memset(malloc(n),0,n) is calloc(n,1). */
static bool
handle_builtin_memset (gimple_stmt_iterator *gsi)
...
if (code1 == BUILT_IN_CALLOC)
/* Not touching stmt1 */ ;
else if (code1 == BUILT_IN_MALLOC
&& operand_equal_p (gimple_call_arg (stmt1, 0), size, 0))
{
gimple_stmt_iterator gsi1 = gsi_for_stmt (stmt1);
update_gimple_call (&gsi1, builtin_decl_implicit (BUILT_IN_CALLOC), 2,
size, build_one_cst (size_type_node));
si1->length = build_int_cst (size_type_node, 0);
si1->stmt = gsi_stmt (gsi1);
}
This was discussed in gcc-patches mailing list in Mar 1, 2014 - Jul 15, 2014 with subject "calloc = malloc + memset"
https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01693.html
https://gcc.gnu.org/ml/gcc-patches/2014-03/threads.html#00009
https://gcc.gnu.org/ml/gcc-patches/2014-04/threads.html#00817
https://gcc.gnu.org/ml/gcc-patches/2014-05/msg01392.html
https://gcc.gnu.org/ml/gcc-patches/2014-06/threads.html#00234
https://gcc.gnu.org/ml/gcc-patches/2014-07/threads.html#01059
with notable comment from Andi Kleen (http://halobates.de/blog/, https://github.com/andikleen): https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01818.html
FWIW i believe the transformation will break a large variety of micro
benchmarks.
calloc internally knows that memory fresh from the OS is zeroed. But
the memory may not be faulted in yet.
memset always faults in the memory.
So if you have some test like
buf = malloc(...)
memset(buf, ...)
start = get_time();
... do something with buf
end = get_time()
Now the times will be completely off because the measured times
includes the page faults.
Marc replied "Good point. I guess working around compiler optimizations is part of the game for micro benchmarks, and their authors would be disappointed if the compiler didn't mess it up regularly in new and entertaining ways ;-)" and Andi asked: "I would prefer to not do it. I'm not sure it has a lot of benefit. If you want to keep it please make sure there is an easy way to turn it off."
Marc shows how to turn this optimization off: https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01834.html
Any of these flags works:
-fdisable-tree-strlen
-fno-builtin-malloc
-fno-builtin-memset (assuming you wrote 'memset' explicitly in your code)
-fno-builtin
-ffreestanding
-O1
-Os
In the code, you can hide that the pointer passed to memset is the
one returned by malloc by storing it in a volatile variable, or
any other trick to hide from the compiler that we are doing
memset(malloc(n),0,n).
I wanted to calculate the difference in execution time when executing the same code inside a function. To my surprise, however, sometimes the clock difference is 0 when I use clock()/clock_t for the start and stop timer. Does this mean that clock()/clock_t does not actually return the number of clicks the processor spent on the task?
After a bit of searching, it seemed to me that clock_gettime() would return more fine grained results. And indeed it does, but I instead end up with an abitrary number of nano(?)seconds. It gives a hint of the difference in execution time, but it's hardly accurate as to exactly how many clicks difference it amounts to. What would I have to do to find this out?
#include <math.h>
#include <stdio.h>
#include <time.h>
#define M_PI_DOUBLE (M_PI * 2)
void rotatetest(const float *x, const float *c, float *result) {
float rotationfraction = *x / *c;
*result = M_PI_DOUBLE * rotationfraction;
}
int main() {
int i;
long test_total = 0;
int test_count = 1000000;
struct timespec test_time_begin;
struct timespec test_time_end;
float r = 50.f;
float c = 2 * M_PI * r;
float x = 3.f;
float result_inline = 0.f;
float result_function = 0.f;
for (i = 0; i < test_count; i++) {
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &test_time_begin);
float rotationfraction = x / c;
result_inline = M_PI_DOUBLE * rotationfraction;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &test_time_end);
test_total += test_time_end.tv_nsec - test_time_begin.tv_nsec;
}
printf("Inline clocks %li, avg %f (result is %f)\n", test_total, test_total / (float)test_count,result_inline);
for (i = 0; i < test_count; i++) {
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &test_time_begin);
rotatetest(&x, &c, &result_function);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &test_time_end);
test_total += test_time_end.tv_nsec - test_time_begin.tv_nsec;
}
printf("Function clocks %li, avg %f (result is %f)\n", test_total, test_total / (float)test_count, result_inline);
return 0;
}
I am using gcc version 4.8.4 on Linux 3.13.0-37-generic (Linux Mint 16)
First of all: As already mentioned in the comments, clocking a single run of execution one by the other will probably do you no good. If all goes down the hill, the call for getting the time might actually take longer than the actual execution of the operation.
Please clock multiple runs of the operation (including a warm up phase so everything is swapped in) and calculate the average running times.
clock() isn't guaranteed to be monotonic. It also isn't the number of processor clicks (whatever you define this to be) the program has run. The best way to describe the result from clock() is probably "a best effort estimation of the time any one of the CPUs has spent on calculation for the current process". For benchmarking purposes clock() is thus mostly useless.
As per specification:
The clock() function returns the implementation's best approximation to the processor time used by the process since the beginning of an implementation-dependent time related only to the process invocation.
And additionally
To determine the time in seconds, the value returned by clock() should be divided by the value of the macro CLOCKS_PER_SEC.
So, if you call clock() more often than the resolution, you are out of luck.
For profiling/benchmarking, you should --if possible-- use one of the performance clocks that are available on modern hardware. The prime candidates are probably
The HPET
The TSC
Edit: The question now references CLOCK_PROCESS_CPUTIME_ID, which is Linux' way of exposing the TSC.
If any (or both) are available depends on the hardware in is also operating system specific.
After googling a little bit I can see that clock() function can be used as a standard mechanism to find the tome taken for execution , but be aware that the time will be varying at different time depending upon the load of your processor,
You can just use the below code for calculation
clock_t begin, end;
double time_spent;
begin = clock();
/* here, do your time-consuming job */
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
I have simple function which takes random words and puts them in lexicographical order using insertion sort algorithm.I have no problem with function(It works,tested),but when i try to measure execution time of function using two different clock() values, i get same values before and after the execution of function,so it shows 0 as elapsed time
clock_t t1 = clock();
InsertionSort(data, n);
clock_t t2 = clock();
/*
* Display the results.
*/
for (size = i, i = 0; i < size; ++i)
{
printf("data[%d] = \"%s\"\n", (int)i, data[i]);
}
/*
* Display the execution time
*/
printf("The time taken is.. %g ", (t2 -t1));
The time difference is too small to be measured by this method, without adding more code to execute. – Weather Vane
Usually, you contrive a way to measure a large number of loops of what you want to time. 10, 100, 1000, whatever produces a significant result. Bear in mind too that on a multi-tasking OS each iteration will take a slightly different time, and so you'll also establish a typical average.The result might also be affected by processor caching and/or file caching. – Weather Vane
Try like this:
#include <sys/types.h>
#include <sys/time.h>
#include <stdlib.h>
#include <stdio.h>
double gettime(void)
{
struct timediff td;
double d=0;
gettimeofday(&td, NULL);
d=td.td_usec;
d+= (double)td.td_usecs / 1000000.;
return d;
}
double t1=gettime();
InsertionSort(data, n);
printf("%.6f", gettime() - t1);
or may be you need to change your code like this:
clock_t t1 = clock();
InsertionSort(data, n);
clock_t t2 = clock();
double d= double(t2- t1) / CLOCKS_PER_SEC;
You can also refer: Easily measure elapsed time
You are incorrectly using the floating-point format specifier %g. Try this
printf("The time taken is.. %u clock ticks", (unsigned)(t2 -t1));
Always assuming the execution time is longer than the granularity of clock().
I'm trying to measure differences in speed of reading and writing misaligned vs aligned bits into binary files. I would like to know is there an utility I can use (Except for running time over & over again and writing my own) to sample an average run-time of a program (I'm running Linux based OS)?
Thanks
running time over & over again and writing my own
That's fine. You can perform the read/write ten thousand times both ways and compute the average time.
If you really want to use a library you can try Google Perftools.
Put this in a header file:
#ifndef TIMER_H
#define TIMER_H
#include <stdlib>
#include <sys/time.h>
typedef unsigned long long timestamp_t;
static timestamp_t
get_timestamp ()
{
struct timeval now;
gettimeofday (&now, NULL);
return now.tv_usec + (timestamp_t)now.tv_sec * 1000000;
}
#endif
Include the header file into whichever .c file you'll be using, and do something like this:
#define N 10000
int main()
{
int i;
double avg;
timestamp_t start, end;
start = get_timestamp();
for(i = 0; i < N; i++)
foo();
end = get_timestamp();
avg = (end - start) / (double)N;
printf("%f", avg);
return 0;
}
Basically this calls whichever function you're trying to measure performance of N times, where N is a defined constant (doesn't have to be) in this case. It takes a timestamp before the for loop and after the for loop and then calculates the average time it's taken for the function to execute. The get_timestamp() function returns the number of microseconds, so if you need milliseconds, divide by 1000, seconds - divide by 1000000 etc.