I am writing a code to measure the time consumption of a sequence of codes in kernel by loading the codes as module into the kernel. I uses common rdtsc routine to calculate the time. Interesting thing is similar routine running in user mode results in normal values, whereas the results is always 0 when running in kernel mode, no matter how many lines of codes I have added into the time_count function. The calculation I use here is a common matrix product function, and the running cycles should increase rapidly through the increasing of matrix dimension. Can anyone point out the mistakes in my code why I could not measure the cycle number in kernel?
#include <linux/init.h>
#include <linux/module.h>
int matrix_product(){
int array1[500][500], array2[500][500], array3[500][500];
int i, j, k, sum;
for(i = 0; i < 50000; i++){
for(j = 0; j < 50000; j++){
array1[i][j] = 5*i + j;
array2[i][j] = 5*i + j;
}
}
for(i = 0; i < 50000; i++){
for(j = 0; j < 50000; j++){
for(k = 0; k < 50000; k++)
sum += array1[i][k]*array2[k][j];
array3[i][j] = sum;
sum = 0;
}
}
return 0;
}
static __inline__ unsigned long long rdtsc(void)
{
unsigned long hi, lo;
__asm__ __volatile__ ("xorl %%eax,%%eax\ncpuid" ::: "%rax", "%rbx", "%rcx", "%rdx");
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ((unsigned long long)lo) | (((unsigned long long)hi)<<32) ;
}
static int my_init(void)
{
unsigned long str, end, curr, best, tsc, best_curr;
long i, t;
#define time_count(codes) for(i=0; i<120000; i++){str=rdtsc(); codes; end=rdtsc(); curr=end-str; if(curr<best)best=curr;}
best = ~0;
time_count();
tsc = best;
best = ~0;
time_count(matrix_product());
best_curr = best;
printk("<0>matrix product: %lu ticks\n", best_curr-tsc);
return 0;
}
static void my_exit(void){
return;
}
module_init(my_init);
module_exit(my_exit);`
Any help is appreciated! Thanks.
rdtsc is not guaranteed to be available on every CPU, or to run at a constant rate, or be consistent between different cores.
You should use a reliable and portable function like getrawmonotonic unless you have special requirements for the timestamps.
If you really want to use cycles directly, the kernel already defines get_cycles and cpuid functions for this.
Related
I have a question with MultiThread.
This code is simple Example about comparing Single Thread vs MultiThread.
(sum 0~400,000,000 with singlethread vs 4-multiThread)
//Single
#include<pthread.h>
#include<unistd.h>
#include<stdio.h>
#include<stdlib.h>
#define NUM_THREAD 4
#define MY_NUM 100000000
void* calcThread(void* param);
double total = 0;
double sum[NUM_THREAD] = { 0, };
int main() {
long p[NUM_THREAD] = {MY_NUM, MY_NUM * 2,MY_NUM * 3,MY_NUM * 4 };
int i;
long total_nstime;
struct timespec begin, end;
pthread_t tid[NUM_THREAD];
pthread_attr_t attr[NUM_THREAD];
clock_gettime(CLOCK_MONOTONIC, &begin);
for (i = 0; i < NUM_THREAD; i++) {
calcThread((void*)p[i]);
}
for (i = 0; i < NUM_THREAD; i++) {
total += sum[i];
}
clock_gettime(CLOCK_MONOTONIC, &end);
printf("total = %lf\n", total);
total_nstime = (end.tv_sec - begin.tv_sec) * 1000000000 + (end.tv_nsec - begin.tv_nsec);
printf("%.3fs\n", (float)total_nstime / 1000000000);
return 0;
}
void* calcThread(void* param) {
int i;
long to = (long)(param);
int from = to - MY_NUM + 1;
int th_num = from / MY_NUM;
for (i = from; i <= to; i++)
sum[th_num] += i;
}
I wanna change using 4-MultiThread Code, so I changed that calculate function to using MultiThread.
...
int main() {
...
//createThread
for (i = 0; i < NUM_THREAD; i++) {
pthread_attr_init(&attr[i]);
pthread_create(&tid[i],&attr[i],calcThread,(void *)p[i]);
}
//wait
for(i=0;i<NUM_THREAD;i++){
pthread_join(tid[i],NULL);
}
for (i = 0; i < NUM_THREAD; i++) {
total += sum[i];
}
clock_gettime(CLOCK_MONOTONIC, &end);
...
}
Result(in Ubuntu)
But,It's slower than Single Function Code. I know MultiThread is faster.
I have no idea with this problem :( What's wrong?
Could you give me some advice ? Thanks a lot!
"I know MultiThread is faster"
This isn't always the case, as generally you would be CPU bound in some way, whether that be due to core count, how it is scheduled at the OS level, and hardware level.
It is a balance how many threads is worth giving to a process, as you may run into an old Linux problem where you would be spending more time scheduling the processes than actually running them.
As this is very hardware and OS dependant, it is difficult to say exactly what the issue may be, but make sure you have the appropriate microcode for your CPU installed (generally installed by default in Ubuntu), but just in case, try:
sudo apt-get install intel-microcode
Otherwise look at what other processes are being run, and it may be that a lot of other things are running on the cores that are being allocated the process.
I am trying to parallize RSA algorithm with the help of repeated square and multiply method in openmp.
code is as follow:
long long unsigned int mod_exp(int base,int exp,int n)
{
long long unsigned int i,pow1=1,pow2=1,pow3=1,pow4=1,pow=1,pow5=1;
int exp1=exp/4;
int id;
for(i=0;i<exp1;i++)
pow1=(pow1*base)%n;
for(i=0;i<exp1;i++)
pow2=(pow2*base)%n;
for(i=0;i<exp1;i++)
pow3=(pow3*base)%n;
for(i=0;i<exp1;i++)
pow4=(pow4*base)%n;
for(i=0;i<1;i++)
pow5=(pow5*base)%n;
pow=pow1*pow2*pow3*pow4*pow5;
pow=pow%n;
return pow;
}
just with #pragma omp for i am unable to find get the correct output.
kindly help
I guess you could go for something like this:
long long unsigned int i, pow = 1, exp1 = exp / 4;
int k;
#pragma omp parallel for reduction( * : pow )
for ( k = 0; k < 5; k++ ) {
for ( i = 0; i < exp1 ; i++ ) {
pow = ( pow * base ) % n;
}
}
That should work but I doubt it would do you much good since the amount of work is so limited that the parallelisation overhead is very likely to slow-down the code instead of speeding it up.
EDIT: hum, actually, I can't make sense of the initial code... Why are we doing 5 times the same computation? Did I miss something?
I thought memory access would be faster than the multiplication and division (although compiler-optimized) done with alpha blending. But it wasn't as fast as expected.
The 16 megabytes used for the table is not an issue in this case. But it is a problem if table lookup could even be slower than doing all the CPU calculations.
Can anyone explain to me why and what is happening? Will the table lookup beat out with a slower CPU?
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <time.h>
#define COLOR_MAX UCHAR_MAX
typedef unsigned char color;
color (*blending_table)[COLOR_MAX + 1][COLOR_MAX + 1];
static color blend(unsigned int destination, unsigned int source, unsigned int a) {
return (source * a + destination * (COLOR_MAX - a)) / COLOR_MAX;
}
void initialize_blending_table(void) {
int destination, source, a;
blending_table = malloc((COLOR_MAX + 1) * sizeof *blending_table);
for (destination = 0; destination <= COLOR_MAX; ++destination) {
for (source = 0; source <= COLOR_MAX; ++source) {
for (a = 0; a <= COLOR_MAX; ++a) {
blending_table[destination][source][a] = blend(destination, source, a);
}
}
}
}
struct timer {
double start;
double end;
};
void timer_start(struct timer *self) {
self->start = clock();
}
void timer_end(struct timer *self) {
self->end = clock();
}
double timer_measure_in_seconds(struct timer *self) {
return (self->end - self->start) / CLOCKS_PER_SEC;
}
#define n 300
int main(void) {
struct timer timer;
volatile int i, j, k, l, m;
timer_start(&timer);
initialize_blending_table();
timer_end(&timer);
printf("init %f\n", timer_measure_in_seconds(&timer));
timer_start(&timer);
for (i = 0; i <= n; ++i) {
for (j = 0; j <= COLOR_MAX; ++j) {
for (k = 0; k <= COLOR_MAX; ++k) {
for (l = 0; l <= COLOR_MAX; ++l) {
m = blending_table[j][k][l];
}
}
}
}
timer_end(&timer);
printf("table %f\n", timer_measure_in_seconds(&timer));
timer_start(&timer);
for (i = 0; i <= n; ++i) {
for (j = 0; j <= COLOR_MAX; ++j) {
for (k = 0; k <= COLOR_MAX; ++k) {
for (l = 0; l <= COLOR_MAX; ++l) {
m = blend(j, k, l);
}
}
}
}
timer_end(&timer);
printf("function %f\n", timer_measure_in_seconds(&timer));
return EXIT_SUCCESS;
}
result
$ gcc test.c -O3
$ ./a.out
init 0.034328
table 14.176643
function 14.183924
Table lookup is not a panacea. It helps when the table is small enough, but in your case the table is very big. You write
16 megabytes used for the table is not an issue in this case
which I think is very wrong, and is possibly the source of the problem you experience. 16 megabytes is too big for L1 cache, so reading data from random indices in the table will involve the slower caches (L2, L3, etc). The penalty for cache misses is typically large; your blending algorithm must be very complex if you want your LUT solution to be faster.
Read the Wikipedia article for more info.
Your benchmark is hopelessly broken, it makes the LUT look a lot better than it actually is because it reads the table in-order.
If your performance results show that the LUT is worse than direct calculation, then when you start with real-world random access patterns and cache misses, the LUT is going to be much worse.
Focus on improving the computation, and enabling vectorization. It's likely to pay off far better than a table-based approach.
(source * a + destination * (COLOR_MAX - a)) / COLOR_MAX
with rearrangement becomes
(source * a + destination * COLOR_MAX - destination * a) / COLOR_MAX
which simplifies to
destination + (source - destination) * a / COLOR_MAX
which has one multiply and one division by a constant, both of which are very efficient. And it is easily vectorized.
You should also mark your helper function as inline, although a good optimizing compiler is probably inlining it anyway.
I have used Sieve of Eratosthenes Algorithm to find the sum of prime numbers under certain limit and it has worked correctly until limit of 2 million, but when I tried 3 million the program has stopped while executing. Here is the code:
int main(){
bool x[3000000];
unsigned long long sum = 0;
for(unsigned long long i=0; i< 3000000; i++)
x[i] = true;
x[0] = x[1] = false;
for(unsigned long long i = 2; i < 3000000; i++){
if(x[i]){
for (unsigned long long j = 2; i * j < 3000000; j++) {
x[j*i] = false;
}
sum += i;
}
}
printf("%ld", sum);
return 0;
}
Most likely bool x[3000000]; will result in a stack overflow, as it will require more memory than is typically available on the stack . As a quick fix change this to:
static bool x[3000000];
or consider using dynamic memory allocation:
bool *x = malloc(3000000 * sizeof(*x));
// do your stuff
free(x);
Also note that your printf format specifier is wrong, since sum is declared as unsigned long long - change:
printf("%ld", sum);
to:
printf("%llu", sum);
If you're using a decent compiler (e.g. gcc) with warnings enabled (e.g. gcc -Wall ...) then the compiler should already have warned you about this error.
One further tip: don't use hard-coded constants like 3000000 - use a symbolic constant, and then you only have to define the value in one place - this is known as the "Single Point Of Truth" (SPOT) principle, or "Don't Repeat Yourself" (DRY):
const size_t n = 3000000;
and then wherever you are currently using 3000000 use n instead.
I'm trying to implement a simplified version of Lamport's Bakery Algorithm in C before I attempt to use it to solve a more complex problem.* The simplification I am making is that the lock is only shared by only two threads instead of N.
I set up two threads (via OpenMP to keep things simple) and they loop, attempting to increment a shared counter within their critical section. If everything goes according to plan, then the final counter value should be equal to the number of iterations. However, here's some example output:
count: 9371470 (expected: 10000000)
Doh! Something is broken, but what? My implementation is pretty textbook (for reference), so perhaps I'm misusing memory barriers? Did I forget to mark something as volatile?
My code:
#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <omp.h>
typedef struct
{
volatile bool entering[2];
volatile uint32_t number[2];
} SimpleBakeryLock_t;
inline void mb() { __sync_synchronize(); }
inline void lock(SimpleBakeryLock_t* l, int id)
{
int i = id, j = !id;
uint32_t ni, nj;
l->entering[i] = true;
mb();
ni = 1 + l->number[j];
l->number[i] = ni;
mb();
l->entering[i] = false;
mb();
while (l->entering[j]) {
mb();
}
nj = l->number[j];
mb();
while ((nj != 0) && (nj < ni || (nj == ni && j < i)))
{
nj = l->number[j]; // re-read
mb();
}
}
inline void unlock(SimpleBakeryLock_t* l, int id)
{
l->number[id] = 0;
mb();
}
SimpleBakeryLock_t x;
int main(void)
{
const uint32_t iterations = 10000000;
uint32_t count = 0;
bool once = false;
int i;
memset((void*)&x, 0, sizeof(x));
mb();
// set OMP_NUM_THREADS=2 in your environment!
#pragma omp parallel for schedule(static, 1) private(once, i)
for(uint32_t dummy = 0; dummy < iterations; ++dummy)
{
if (!once)
{
i = omp_get_thread_num();
once = true;
}
lock(&x, i);
{
count = count + 1;
mb();
}
unlock(&x, i);
}
printf("count: %u (expected: %u)\n", count, iterations);
return 0;
}
To compile and run (on Linux), do:
$ gcc -O3 -fopenmp bakery.c
$ export OMP_NUM_THREADS=2
$ ./a.out
I intend to chain simple Bakery locks into a binary tree (tournament style) to achieve mutual exclusion among N threads.
I tracked down two problems and the code now works. Issues:
__sync_synchronize() was not generating the mfence instruction on my platform (Apple's GCC 4.2.1). Replacing __sync_synchronize() with an explicit mfence resolves this issue.
I was doing something wrong with the OpenMP private variables (still not sure what...). Sometimes the two threads entered the lock with the same identity (ex. both may say they were thread 0). Recomputing 'i' with 'omp_get_thread_num' on every iteration seems to do the trick.
Here is the corrected code for completeness:
#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <omp.h>
#define cpu_relax() asm volatile ("pause":::"memory")
#define mb() asm volatile ("mfence":::"memory")
/* Simple Lamport bakery lock for two threads. */
typedef struct
{
volatile uint32_t entering[2];
volatile uint32_t number[2];
} SimpleBakeryLock_t;
void lock(SimpleBakeryLock_t* l, int id)
{
int i = id, j = !id;
uint32_t ni, nj;
l->entering[i] = 1;
mb();
ni = 1 + l->number[j];
l->number[i] = ni;
mb();
l->entering[i] = 0;
mb();
while (l->entering[j]) {
cpu_relax();
}
do {
nj = l->number[j];
} while ((nj != 0) && (nj < ni || (nj == ni && j < i)));
}
void unlock(SimpleBakeryLock_t* l, int id)
{
mb(); /* prevent critical section writes from leaking out over unlock */
l->number[id] = 0;
mb();
}
SimpleBakeryLock_t x;
int main(void)
{
const int32_t iterations = 10000000;
int32_t dummy;
uint32_t count = 0;
memset((void*)&x, 0, sizeof(x));
mb();
// set OMP_NUM_THREADS=2 in your environment!
#pragma omp parallel for schedule(static, 1)
for(dummy = 0; dummy < iterations; ++dummy)
{
int i = omp_get_thread_num();
lock(&x, i);
count = count + 1;
unlock(&x, i);
}
printf("count: %u (expected: %u)\n", count, iterations);
return 0;
}