time larger than fixed in cplex routines - c

i have this in the main :
#define N 23
start-time=clock();
readData(c); // just read a matrix of integer size N (in this case matrix 23*23)
lp (c,d); // resolve it by cplex with a time limit cplex command 1h 30
final-time=clock();
time = (final_time -start-time) *0.001;
printf("\n CPU = %f sec\n\n", time);
the problem shows:
Default row names c1, c2 ... being created.
solution status is Feasible
obj. value: 5557
gap : 1.1697
CPU = 10800.494141 sec
why the time is so large? did the main() spend another 1 h just to read a matrix size 23*23 !!!!!!

clock
The value returned is the CPU time used so far as a clock_t; to get
the number of seconds used, divide by CLOCKS_PER_SEC

Related

Nearly identical code with different running time - Why?

I'm testing two nearly identical codes with a tiny difference on one of the for loops. The first uses three loops to iterate indexes y,z,x while the second iterated x,z,y.
My question is why the difference in user time and wall clock time? Is it because of the memory locations in one code and the other?
test_1.c:
#define N 1000
// Matrix definition
long long int A[N][N],B[N][N],R[N][N];
int main()
{
int x,y,z;
char str[100];
/*Matrix initialization*/
for(y=0;y<N;y++)
for(x=0;x<N;x++)
{
A[y][x]=x;
B[y][x]=y;
R[y][x]=0;
}
/*Matrix multiplication*/
for(y=0;y<N;y++)
for(z=0;z<N;z++)
for(x=0;x<N;x++)
{
R[y][x]+= A[y][z] * B[z][x];
}
exit(0);
}
The difference in the second code (test_2.c) its on the last for loop:
for(x=0;x<N;x++)
for(z=0;z<N;z++)
for(y=0;y<N;y++)
{
R[y][x]+= A[y][z] * B[z][x];
}
If I print /user/bin/time -v ./test_1 I get the following stats:
Command being timed: "./test_1"
User time (seconds): 5.19
System time (seconds): 0.01
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.22
While /user/bin/time -v ./test_2 gives the following stats:
Command being timed: "./test_2"
User time (seconds): 7.75
System time (seconds): 0.00
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.76
Basically, you're accessing the memory in a different pattern - your first approach is much more friendly to the memory cache, because you're accessing a lot of data in the same area, then moving on to the next piece of memory, etc.
If you want a real world analogy, imagine you're delivering leaflets to 10 different roads (A-J), each of which has house numbers 1-10. You could deliver A1, A2, A3...A10, B1, B2, B3...B10 etc... or you could deliver A1, B1, C1...J1, A2, B2, C2... etc. Clearly, the first way is going to be more efficient. It's just like that in computer memory - it's more efficient to access memory "near" memory you've recently accessed than it is to hop around.

"-Nan" value for the total sum of array elements with GPU code

I am working on an OpenCL code which computes the sum of array elements. Every works fine up to a size of 1.024 * 1e+8 for the 1D input array but with 1.024 * 1e+9, the final value is "-Nan".
Here's the source of the code on this link
The Kernel code is on this link
and the Makefile on this link
Here's the result for the last array size which works (last value of size which works is 1.024 * 1e+8) :
$ ./sumReductionGPU 102400000
Max WorkGroup size = 4100
Number of WorkGroups = 800000
Problem size = 102400000
Final Sum Sequential = 5.2428800512000000000e+15
Final Sum GPU = 5.2428800512000000000e+15
Initializing Arrays : Wall Clock = 0 second 673785 micro
Preparing GPU/OpenCL : Wall Clock = 1 second 925451 micro
Time for one NDRangeKernel call and WorkGroups final Sum : Wall Clock = 0 second 30511 micro
Time for Sequential Sum computing : Wall Clock = 0 second 398485 micro
I have taken local_item_size = 128, so as it is indicated above, I have 800000 Work-Groups for NWorkItems = 1.024 * 1e+8.
Now If I take 1.024 * 10^9, the partial sums are no more computed, I get a "-nan" value for total sum of array elements.
$ ./sumReductionGPU 1024000000
Max WorkGroup size = 4100
Number of WorkGroups = 8000000
Problem size = 1024000000
Final Sum Sequential = 5.2428800006710899200e+17
Final Sum GPU = -nan
Initializing Arrays : Wall Clock = 24 second 360088 micro
Preparing GPU/OpenCL : Wall Clock = 19 second 494640 micro
Time for one NDRangeKernel call and WorkGroups final Sum : Wall Clock = 0 second 481910 micro
Time for Sequential Sum computing : Wall Clock = 166 second 214384 micro
Maybe I have reached the limit of what GPU unit can compute. But I would like to get your advice to confirm this.
If a double is 8 bytes, this will require 1.024 * 1e9 * 8 ~ 8 GBytes for the input array : isn't it too much ? I have only 8 GBytes of RAM.
From your experience, where this issue could come from ?
Thanks
As you already found out, your 1D input array requires a lot of memory. Thus, the memory allocation with malloc or clCreateBuffer are prone to fail.
For the malloc, I suggest to use a helper function checked_malloc which detects a failed memory allocation, prints out a message and exits the program.
#include <stdlib.h>
#include <stdio.h>
void * checked_malloc(size_t size, const char purpose[]) {
void *result = malloc(size);
if(result == NULL) {
fprintf(stderr, "ERROR: malloc failed for %s\n", purpose);
exit(1);
}
return result;
}
int main()
{
double *p1 = checked_malloc(1e8 * sizeof *p1, "array1");
double *p2 = checked_malloc(64 * 1e9 * sizeof *p2, "array2");
return 0;
}
On my PC which has only 48 GB of virtual memory, the second allocation failes and the program prints:
ERROR: malloc failed for array2
You can apply this scheme also for clCreateBuffer. But, you have to check the result of every OpenCL call anyway. So, I recommend to use a macro for this:
#define CHECK_CL_ERROR(result) if(result != CL_SUCCESS) { \
fprintf(stderr, "OpenCL call failed at: %s:%d with code %d\n", __FILE__, __LINE__, result); }
An example usage would be:
cl_mem inputBuffer = clCreateBuffer(context, CL_MEM_READ_ONLY,
nWorkItems * sizeof(double), NULL, &ret);
CHECK_CL_ERROR(ret);

Using multithreads to calculate data but it does't reduce the time

My CPU has four cores,MAC os. I use 4 threads to calculate an array. But the time of calculating does't being reduced. If I don't use multithread, the time of calculating is about 52 seconds. But even I use 4 multithreads, or 2 threads, the time doesn't change.
(I know why this happen now. The problem is that I use clock() to calculate the time. It is wrong when it is used in multithread program because this function will multiple the real time based on the num of threads. When I use time() to calculate the time, the result is correct.)
The output of using 2 threads:
id 1 use time = 43 sec to finish
id 0 use time = 51 sec to finish
time for round 1 = 51 sec
id 1 use time = 44 sec to finish
id 0 use time = 52 sec to finish
time for round 2 = 52 sec
id 1 and id 0 is thread 1 and thread 0. time for round is the time of finishing two threads. If I don't use multithread, time for round is also about 52 seconds.
This is the part of calling 4 threads:
for(i=1;i<=round;i++)
{
time_round_start=clock();
for(j=0;j<THREAD_NUM;j++)
{
cal_arg[j].roundth=i;
pthread_create(&thread_t_id[j], NULL, Multi_Calculate, &cal_arg[j]);
}
for(j=0;j<THREAD_NUM;j++)
{
pthread_join(thread_t_id[j], NULL);
}
time_round_end=clock();
int round_time=(int)((time_round_end-time_round_start)/CLOCKS_PER_SEC);
printf("time for round %d = %d sec\n",i,round_time);
}
This is the code inside the thread function:
void *Multi_Calculate(void *arg)
{
struct multi_cal_data cal=*((struct multi_cal_data *)arg);
int p_id=cal.thread_id;
int i=0;
int root_level=0;
int leaf_addr=0;
int neighbor_root_level=0;
int neighbor_leaf_addr=0;
Neighbor *locate_neighbor=(Neighbor *)malloc(sizeof(Neighbor));
printf("id:%d, start:%d end:%d,round:%d\n",p_id,cal.start_num,cal.end_num,cal.roundth);
for(i=cal.start_num;i<=cal.end_num;i++)
{
root_level=i/NUM_OF_EACH_LEVEL;
leaf_addr=i%NUM_OF_EACH_LEVEL;
if(root_addr[root_level][leaf_addr].node_value!=i)
{
//ignore, because this is a gap, no this node
}
else
{
int k=0;
locate_neighbor=root_addr[root_level][leaf_addr].head;
double tmp_credit=0;
for(k=0;k<root_addr[root_level][leaf_addr].degree;k++)
{
neighbor_root_level=locate_neighbor->neighbor_value/NUM_OF_EACH_LEVEL;
neighbor_leaf_addr=locate_neighbor->neighbor_value%NUM_OF_EACH_LEVEL;
tmp_credit += root_addr[neighbor_root_level][neighbor_leaf_addr].g_credit[cal.roundth-1]/root_addr[neighbor_root_level][neighbor_leaf_addr].degree;
locate_neighbor=locate_neighbor->next;
}
root_addr[root_level][leaf_addr].g_credit[cal.roundth]=tmp_credit;
}
}
return 0;
}
The array is very large, each thread calculate part of the array.
Is there something wrong with my code?
It could be a bug, but if you feel the code is correct, then the overhead of parallelization, mutexes and such, might mean the overall performance (runtime) is the same as for the non-parallelized code, for the size of elements to compute against.
It might be an interesting study, to do looped code, single-threaded, and the threaded code, against very large arrays (100k elements?), and see if the results start to diverge to be faster in the parallel/threaded code?
Amdahl's law, also known as Amdahl's argument,[1] is used to find the maximum expected improvement to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors.
https://en.wikipedia.org/wiki/Amdahl%27s_law
You don't always gain speed by multi-threading a program. There is a certain amount of overhead that comes with threading. Unless there is enough inefficiencies in the non-threaded code to make up for the overhead, you'll not see an improvement. A lot can be learned about how multi-threading works even if the program you write ends up running slower.
I know why this happen now. The problem is that I use clock() to calculate the time. It is wrong when it is used in multithread program because this function will multiple the real time based on the num of threads. When I use time() to calculate the time, the result is correct.

Trying to get the time of an operation and receiving time 0 seconds

I am trying to see how much does it take for about 10000 names to be inserted into a BST(writing in c).
I am reading these names from a txt file using fscanf. I have declared a file pointer(fp) at the main function. Calling a function that is at another .c file passing the fp through its arguments. I want to count the time needed for 2,4,8,16,32...,8192 names to be inserted saving the time at a long double array. I have included the time.h library at the .c file where the function is located.
Code:
void myfunct(BulkTreePtr *Bulktree, FILE* fp,long double time[])
{
double tstart, tend, ttemp;
TStoixeioyTree datainput;
int error = 0,counter=0,index=0,num=2,i;
tstart = ((double) clock())/CLOCKS_PER_SEC;
while (!feof(fp))
{
counter++;
fscanf(fp,"%s %s", datainput.lname, datainput.fname);
Tree_input(&((*Bulktree)->TreeRoot), datainput, &error);
if (counter == num)
{
ttemp = (double) clock()/CLOCKS_PER_SEC;
time[index] = ttemp-tstart;
num = num * 2;
index++;
}
}
tend = ((double) clock())/CLOCKS_PER_SEC;
printf("Last value of ttemp is %f\n",ttemp-tstart);
time[index] = (tend-tstart);
num = 2;
for(i=0;i<14;i++)
{
printf("Time after %d names is %f sec \n", num, (float)time[i]);
num=num*2;
}
I am getting this:
Last value of ttemp is 0.000000
Time after 2 names is 0.000000 sec
Time after 4 names is 0.000000 sec
Time after 8 names is 0.000000 sec
Time after 16 names is 0.000000 sec
Time after 32 names is 0.000000
ms Time after 64 names is
0.000000 sec Time after 128 names is 0.000000 sec Time after 256
names is 0.000000 sec Time after
512 names is 0.000000 sec Time
after 1024 names is 0.000000 sec
Time after 2048 names is 0.000000 sec
Time after 4096 names is 0.000000
sec Time after 8192 names is
0.000000 sec Time after 16384 names is 0.010000 sec
What am I doing wrong? :S
Use clock_getres() and clock_gettime(). Most likely you will find your system doesn't have a very fast clock. Note that the system might return different numbers when calling gettimeofday or clock_gettime(), but often times (depending on kernel) those numbers at greater than HZ resolution are lies generated to simulate time advancing.
You might find a better test to do fixed time tests. Find out how many inserts you can do in 10 seconds. Or have some kind of fast reset method (memset?) and find out how many groups of inserts of 1024 names you can do in 10 seconds.
[EDIT]
Traditionally, the kernel gets interrupted at HZ frequency by the hardware. Only when it gets this hardware interrupt does it know that time had advanced by 1/HZ of a second. The traditional value for HZ was 1/100 of a second. Surprise, surprise, you saw a 1/100th of a second increment in time. Now some systems and kernels have recently started providing other methods of getting higher resolution time, looking at the RTC device or whatever.
However, you should use the clock_gettime() function I pointed you to along with the clock_getres() function to find out how often you will get accurate time updates. Make sure your test runs many many multiples of clock_getres() unless you want it to be a total lie.
clock() returns the number of "ticks"; there are CLOCKS_PER_SEC ticks per second. For any operation which takes less than 1/CLOCKS_PER_SEC seconds, the return value of clock() will either be unchanged or changed by 1 tick.
From your results it looks like even 16384 insertions take no more than 1/100 seconds.
If you want to know how long a certain number of insertions take, try repeating them many, many times so that the total number of ticks is significant, and then divide that total time by the number of times they were repeated.
clock returns the amount of cpu time used, not the amount of actual time elapsed, but that might be what you want here. Note that Unix standards requires CLOCKS_PER_SECOND to be exactly one million (1000000), but the resolution can be much worse (e.g. it might jump by 10000 at a time). You should be using clock_gettime with the cpu time clock if you want to measure cpu time spent, or otherwise with the monotonic clock to measure realtime spent.
ImageMagick includes stopwatch functions such as these.
#include "magick/MagickCore.h"
#include "magick/timer.h"
TimerInfo *timer_info;
timer_info = AcquireTimerInfo();
<your code>
printf("elapsed=%.6f sec", GetElapsedTime(timer_info));
But that only seems to have a resolution of 1/100 second. Plus it requires installing ImageMagic. I suggest this instead. It's simple and has usec resolution in Linux.
#include <time.h>
double timer(double start_secs)
{
static struct timeval tv;
static struct timezone tz;
gettimeofday(&tv, &tz);
double now_secs = (double)tv.tv_sec + (double)tv.tv_usec/1000000.0;
return now_secs - start_secs;
}
double t1 = timer(0);
<your code>
printf("elapsed=%.6f sec", timer(t1));

UTC time stamp on Windows

I have a buffer with the UTC time stamp in C, I broadcast that buffer after every ten seconds. The problem is that the time difference between two packets is not consistent. After 5 to 10 iterations the time difference becomes 9, 11 and then again 10. Kindly help me to sort out this problem.
I am using <time.h> for UTC time.
If your time stamp has only 1 second resolution then there will always be +/- 1 uncertainty in the least significant digit (i.e. +/- 1 second in this case).
Clarification: if you only have a resolution of 1 second then your time values are quantized. The real time, t, represented by such a quantized value has a range of t..t+0.9999. If you take the difference of two such times, t0 and t1, then the maximum error in t1-t0 is -0.999..+0.999, which when quantized is +/-1 second. So in your case you would expect to see difference values in the range 9..11 seconds.
A thread that sleeps for X milliseconds is not guaranteed to sleep for precisely that many milliseconds. I am assuming that you have a statement that goes something like:
while(1) {
...
sleep(10); // Sleep for 10 seconds.
// fetch timestamp and send
}
You will get a more accurate gauge of time if you sleep for shorter periods (say 20 milliseconds) in a loop checking until the time has expired. When you sleep for 10 seconds, your thread gets moved further out of the immediate scheduling priority of the underlying OS.
You might also take into account that the time taken to send the timestamps may vary, depending on network conditions, etc, if you do a sleep(10) -> send ->sleep(10) type of loop, the time taken to send will be added onto the next sleep(10) in real terms.
Try something like this (forgive me, my C is a little rusty):
bool expired = false;
double last, current;
double t1, t2;
double difference = 0;
while(1) {
...
last = (double)clock();
while(!expired) {
usleep(200); // sleep for 20 milliseconds
current = (double)clock();
if(((current - last) / (double)CLOCKS_PER_SEC) >= (10.0 - difference))
expired = true;
}
t1 = (double)clock();
// Set and send the timestamp.
t2 = (double)clock();
//
// Calculate how long it took to send the stamps.
// and take that away from the next sleep cycle.
//
difference = (t2 - t1) / (double)CLOCKS_PER_SEC;
expired = false;
}
If you are not bothered about using the standard C library, you could look at using the high resolution timer functionality of windows such as QueryPerformanceFrequency/QueryPerformanceCounter functions.
LONG_INTEGER freq;
LONG_INTEGER t2, t1;
//
// Get the resolution of the timer.
//
QueryPerformanceFrequency(&freq);
// Start Task.
QueryPerformanceCounter(&t1);
... Do something ....
QueryPerformanceCounter(&t2);
// Very accurate duration in seconds.
double duration = (double)(t2.QuadPart - t1.QuadPart) / (double)freq.QuadPart;

Resources