Does this multithreaded program perform better than the non-multithreaded one?

Does this multithreaded program perform better than the non-multithreaded one? - c

A colleague of mine asked me to write a homework for him. Although this wasn’t too ethical I did it, I plead guilty.
This is how the problem goes:
Write a program in C where the sequence 12 + 22 + ... + n2 is calculated.
Assume that n is multiple of p and p is the number of threads.
This is what I wrote:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define SQR(X) ((X) * (X))
int n, p = 10, total_sum = 0;
pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;
/* Function prototype */
void *do_calc(void *arg);
int main(int argc, char** argv)
{
int i;
pthread_t *thread_array;
printf("Type number n: ");
fscanf(stdin, "%d", &n);
if (n % p != 0 ) {
fprintf(stderr, "Number must be multiple of 10 (number of threads)\n");
exit(-1);
}
thread_array = (pthread_t *) malloc(p * sizeof(pthread_t));
for (i = 0; i < p; i++)
pthread_create(&thread_array[i], NULL, do_calc, (void *) i);
for (i = 0; i < p; i++)
pthread_join(thread_array[i], NULL);
printf("Total sum: %d\n", total_sum);
pthread_exit(NULL);
}
void *do_calc(void *arg)
{
int i, local_sum = 0;
int thr = (int) arg;
pthread_mutex_lock(&mtx);
for (i = thr * (n / p); i < ((thr + 1) * (n / p)); i++)
local_sum += SQR(i + 1);
total_sum += local_sum;
pthread_mutex_unlock(&mtx);
pthread_exit(NULL);
}
Aside from the logical/syntactic point of view, I was wondering:
how the respective non-multithreaded program would perform
how could I test/see their performance
what would be the program without using threads
Thanks in advance and I’m looking forward to reading your thoughts

You are acquiring the Mutex before the calculations. You should do that immediately before summing to local values.
pthread_mutex_lock(&mtx);
total_sum += local_sum;
pthread_mutex_unlock(&mtx);

This would depend on how many CPUs you have. With a single CPU core, a computation-bound program will never run faster with multiple threads.
Moreover, since you're doing all the work with the lock held, you'll end up with only a single thread running at any time, so it's effectively single threaded anyway.

Don't bother with threading etc. In fact, don't do any additions in a loop at all. Just use this formula:
∑(r = 1; n) r^2 = 1/6 * n (n + 1)(2 n + 1) [1]
[1]http://thesaurus.maths.org/mmkb/entry.html?action=entryById&id=1539

As your code is serialised by a mutex in the actual calculation, it will be slower than a non-threaded version. Of course, you could easily have tested this for yourself.

i would try to see how much do those calculations take. In case it's a very small fraction of time then i would probably gone for a single process model since spawning a thread for each calculation involves some overhead by it self.

to compare performance just remember system time at program start, call it from n=1000 and see system time at the end. compare to non-threaded program result.
as bdonlan said, non-threaded will run faster

1) Single threaded would probably perform a bit better than this, because all calculations are done within a lock and the overhead of locking will add to the total time. You are better off only locking when adding the local sums to the total sum, or storing the local sums in an array and calculating the total sum in the main thread.
2) Use timing statements in your code to measure elapsed time during the algoritm. In the multithreaded case, only measure elapsed time on the main thread.
3) Derived from your code:
int i, total_sum = 0;
for (i = 0; i < n; i++)
total_sum += SQR(i + 1);

A much larger consideration comes to scheduling. The easiest way for kernel-side threading to be implemented is for each thread to get equal time regardless. Processes are just threads with their own memory space. IF all threads get equal time, adding a thread takes you from 1/n of the time to 2/(n + 1) of the time, which is obviously better given > 0 other threads that aren't you.
Actual implementations may and do vary wildly though.

Off-topic a bit, but maybe avoid the mutex by having each thread write it's result into an array element (so assign "results = calloc(sizeof(int), p)" (btw "p" is an awful name for the variable holding the number of threads) and results[thr] = local_sum), and have the joining thread (well, main()) do the summing of the results. So each thread is responsible for just calculating its total: only main(), which orchestrates the threads, joins their data together. Separation of concerns.
For extra credit (:p), use the arg passed to do_calc() as a way to pass the thread ID and the location to write the result to rather than relying on a global array.

Related

Use the maximum CPU to solve the permutations of 10 in the shortest possible time

I am using this program written in C to determine the permutations of size 10 of an regular alphabet.
When I run the program it only uses 36% of my 3GHz CPU leaving 50% free. It also only uses 7MB of my 8GB of RAM.
I would like to use at least 70-80% of my computer's performance and not just this misery. This limitation is making this procedure very time consuming and I don't know when (number of days) I will need to have the complete output. I need help to resolve this issue in the shortest possible time, whether improving the source code or other possibilities.
Any help is welcome even if this solution goes through instead of using the C language use another one that gives me better performance in the execution of the program.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
static int count = 0;
void print_permutations(char arr[], char prefix[], int n, int k) {
int i, j, l = strlen(prefix);
char newprefix[l + 2];
if (k == 0) {
printf("%d %s\n", ++count, prefix);
return;
}
for (i = 0; i < n; i++) {
//Concatenation of currentPrefix + arr[i] = newPrefix
for (j = 0; j < l; j++)
newprefix[j] = prefix[j];
newprefix[l] = arr[i];
newprefix[l + 1] = '\0';
print_permutations(arr, newprefix, n, k - 1);
}
}
int main() {
int n = 26, k = 10;
char arr[27] = "abcdefghijklmnopqrstuvwxyz";
print_permutations(arr, "", n, k);
system("pause");
return 0;
}

There are fundamental problems with your approach:
What are you trying to achieve?
If you want to enumerate the permutations of size 10 of a regular alphabet, your program is flawed as it enumerates all combinations of 10 letters from the alphabet. Your program will produce 2610 combinations, a huge number, 141167095653376, 141,167 billion! Ignoring the numbering, which will exceed the range of type int, that's more than 1.5 Petabytes, unlikely to fit on your storage space. Writing this at the top speed of 100MB/s would take more than 20 days.
The number of permutations, that is combinations of distinct letters from the 26 letter alphabet is not quite as large: 26! / 16! which is still large: 19275223968000, 7 times less than the previous result. That is still more than 212 terabytes of storage and 3 days at 100MB/s.
Storing these permutations is therefore impractical. You could change your program to just count the permutations and measure how long it takes if the count is the expected value. The first step of course is to correct your program to produce the correct set.
Test on smaller sets to verify correctness
Given the expected size of the problem, you should first test for smaller values such as enumerating permutations of 1, 2 and 3 letters to verify that you get the expected number of results.
Once you have correctness, only then focus on performance
Selecting different output methods, from printf("%d %s\n", ++count, prefix); to ++count; puts(prefix); to just ++count;, you will see that most of the time is spent in producing the output. Once you stop producing output, you might see that strlen() consumes a significant fraction of the execution time, which is useless since you can pass the prefix length from the caller. Further improvements may come from using a common array for the current prefix, precluding the need to copy at each recursive step.
Using multiple threads each producing its own output, for example each with a different initial letter, will not improve the overall time as the bottleneck is the bandwidth of the output device. But if you reduce the program to just enumerate and count the permutations, you might get faster execution with multiple threads, one per core, thereby increasing the CPU usage. But this should be the last step in your development.
Memory use is no measure of performance
Using as much memory as possible is not a goal in itself. Some problems may require a tradeoff between memory and time, where faster solving times are achieved using more core memory, but this one does not. 8MB is actually much more than your program's actual needs: this count includes the full stack space assigned to the program, of which only a tiny fraction will be used.
As a matter of fact, using less memory may improve overall performance as the CPU will make better use of its different caches.
Here is a modified program:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
static unsigned long long count;
void print_permutations(char arr[], int n, char used[], char prefix[], int pos, int k) {
if (pos == k) {
prefix[k] = '\0';
++count;
//printf("%llu %s\n", count, prefix);
//puts(prefix);
return;
}
for (int i = 0; i < n; i++) {
if (!used[i]) {
used[i] = 1;
prefix[pos] = arr[i];
print_permutations(arr, n, used, prefix, pos + 1, k);
used[i] = 0;
}
}
}
int main(int argc, char *argv[]) {
int n = 26, k = 10;
char arr[27] = "abcdefghijklmnopqrstuvwxyz";
char used[27] = { 0 };
char perm[27];
unsigned long long expected_count;
clock_t start, elapsed;
if (argc >= 2)
k = strtol(argv[1], NULL, 0);
if (argc >= 3)
n = strtol(argv[2], NULL, 0);
start = clock();
print_permutations(arr, n, used, perm, 0, k);
elapsed = clock() - start;
expected_count = 1;
for (int i = n; i > n - k; i--)
expected_count *= i;
printf("%llu permutations, expected %llu, %.0f permutations per second\n",
count, expected_count, count / ((double)elapsed / CLOCKS_PER_SEC));
return 0;
}
Without output, this program enumerates 140 million combinations per second on my slow laptop, it would take 1.5 days to enumerate the 19275223968000 10-letter permutations from the 26-letter alphabet. It uses almost 100% of a single core, but the CPU is still 63% idle as I have a dual core hyper-threaded Intel Core i5 CPU. Using multiple threads should yield increased performance, but the program must be changed to no longer use a global variable count.

There are multiple reasons for your bad experience:
Your metric:
Your metric is fundamentally flawed. Peak-CPU% is an imprecise measurement for "how much work does my CPU do". Which normally isn't really what you're most interested in. You can inflate this number my doing more work (like starting another thread that doesn't contribute to the output at all).
Your proper metric would be items per second: How many different strings will be printed or written to a file per second. To measure that, start a test run with a smaller size (like k=4), and measure how long it takes.
Your problem: Your problem is hard. Printing or writing down all 26^10 ~1.4e+14 different words with exactly 10 letters will take some time. Even if you changed it to all permutations - which your program doesn't do - it's still ~1.9e13. The resulting file will be 1.4 petabytes - which is most likely more than your hard drive will accept. Also, if you used your CPU to 100% and used one thousand cycles for one word, it'd take 1.5 years. 1000 cycles are an upper bound, you most likely won't be faster that this while still printing your result, as printf usually takes around 1000 cycles to complete.
Your output: Writing to stdout is slow comapred to writing to a file, see https://stackoverflow.com/a/14574238/4838547.
Your program: There are issues with your program that could be a problem for your performance. However, they are dominated by the other problems stated here. With my setup, this program uses 93.6% of its runtime in printf. Therefore, optimizing this code won't yield satisfying results.

Optimising and why openmp is much slower than sequential way?

I am a newbie in programming with OpenMp. I wrote a simple c program to multiply matrix with a vector. Unfortunately, by comparing executing time I found that the OpenMP is much slower than the Sequential way.
Here is my code (Here the matrix is N*N int, vector is N int, result is N long long):
#pragma omp parallel for private(i,j) shared(matrix,vector,result,m_size)
for(i=0;i<m_size;i++)
{
for(j=0;j<m_size;j++)
{
result[i]+=matrix[i][j]*vector[j];
}
}
And this is the code for sequential way:
for (i=0;i<m_size;i++)
for(j=0;j<m_size;j++)
result[i] += matrix[i][j] * vector[j];
When I tried these two implementations with a 999x999 matrix and a 999 vector, the execution time is:
Sequential: 5439 ms
Parallel: 11120 ms
I really cannot understand why OpenMP is much slower than sequential algo (over 2 times slower!) Anyone who can solve my problem?

Your code partially suffers from the so-called false sharing, typical for all cache-coherent systems. In short, many elements of the result[] array fit in the same cache line. When thread i writes to result[i] as a result of the += operator, the cache line holding that part of result[] becomes dirty. The cache coherency protocol then invalidates all copies of that cache line in the other cores and they have to refresh their copy from the upper level cache or from the main memory. As result is an array of long long, then one cache line (64 bytes on x86) holds 8 elements and besides result[i] there are 7 other array elements in the same cache line. Therefore it is possible that two "neighbouring" threads will constantly fight for ownership of the cache line (assuming that each thread runs on a separate core).
To mitigate false sharing in your case, the easiest thing to do is to ensure that each thread gets an iteration block, whose size is divisible by the number of elements in the cache line. For example you can apply the schedule(static,something*8) where something should be big enough so that the iteration space is not fragmented into too many pieces, but in the same time it should be small enough so that each thread gets a block. E.g. for m_size equal to 999 and 4 threads you would apply the schedule(static,256) clause to the parallel for construct.
Another partial reason for the code to run slower might be that when OpenMP is enabled, the compiler might become reluctant to apply some code optimisations when shared variables are being assigned to. OpenMP provides for the so-called relaxed memory model where it is allowed that the local memory view of a shared variable in each threads is different and the flush construct is provided in order to synchronise the views. But compilers usually see shared variables as being implicitly volatile if they cannot prove that other threads would not need to access desynchronised shared variables. You case is one of those, since result[i] is only assigned to and the value of result[i] is never used by other threads. In the serial case the compiler would most likely create a temporary variable to hold the result from the inner loop and would only assign to result[i] once the inner loop has finished. In the parallel case it might decide that this would create a temporary desynchronised view of result[i] in the other threads and hence decide not to apply the optimisation. Just for the record, GCC 4.7.1 with -O3 -ftree-vectorize does the temporary variable trick with both OpenMP enabled and not.

Because when OpenMP distributes the work among threads there is a lot of administration/synchronisation going on to ensure the values in your shared matrix and vector are not corrupted somehow. Even though they are read-only: humans see that easily, your compiler may not.
Things to try out for pedagogic reasons:
0) What happens if matrix and vector are not shared?
1) Parallelize the inner "j-loop" first, keep the outer "i-loop" serial. See what happens.
2) Do not collect the sum in result[i], but in a variable temp and assign its contents to result[i] only after the inner loop is finished to avoid repeated index lookups. Don't forget to init temp to 0 before the inner loop starts.

I did this in reference to Hristo's comment. I tried using schedule(static, 256). For me it makes it does not help changing the default chunck size. Maybe it even makes it worse. I printed out the thread number and its index with and without setting the schedule and it's clear that OpenMP already chooses the thread indices to be far from one another so that false sharing does not seem to be an issue. For me this code already gives a good boost with OpenMP.
#include "stdio.h"
#include <omp.h>
void loop_parallel(const int *matrix, const int ld, const int*vector, long long* result, const int m_size) {
#pragma omp parallel for schedule(static, 250)
//#pragma omp parallel for
for (int i=0;i<m_size;i++) {
//printf("%d %d\n", omp_get_thread_num(), i);
long long sum = 0;
for(int j=0;j<m_size;j++) {
sum += matrix[i*ld +j] * vector[j];
}
result[i] = sum;
}
}
void loop(const int *matrix, const int ld, const int*vector, long long* result, const int m_size) {
for (int i=0;i<m_size;i++) {
long long sum = 0;
for(int j=0;j<m_size;j++) {
sum += matrix[i*ld +j] * vector[j];
}
result[i] = sum;
}
}
int main() {
const int m_size = 1000;
int *matrix = new int[m_size*m_size];
int *vector = new int[m_size];
long long*result = new long long[m_size];
double dtime;
dtime = omp_get_wtime();
loop(matrix, m_size, vector, result, m_size);
dtime = omp_get_wtime() - dtime;
printf("time %f\n", dtime);
dtime = omp_get_wtime();
loop_parallel(matrix, m_size, vector, result, m_size);
dtime = omp_get_wtime() - dtime;
printf("time %f\n", dtime);
}

OpenMP set fixed size of chunks based on a runtime param

Let's say I have a vector of n elements and n_threads available.
I want to use #pragma omp parallel such that each thread receives n / n_threads chunk size,
and the last one more depending on the case.
#include <stdio.h>
#include <omp.h>
int main()
{
int *v = malloc ( n * sizeof(int) );
#pragma omp parallel for (what should i put here?)
for(i = 0; i < n; ++i)
{
++v[i];
}
return 0;
}
Ex: n = 10003, n_threads = 4
thread_0 should get 2500 chunks
thread_1 should get 2500 chunks
thread_2 should get 2500 chunks
thread_3 should get 2503 chunks

In short - you can't do that. All you can do is to specify the schedule(static) clause without specifying the chunk size and the OpenMP runtime will divide the iterations count in approximately the same sized chunks. How exactly will it be done is up to the implementation. This is what the OpenMP standard says about static scheduling:
When schedule(static, chunk_size) is specified, iterations are divided into chunks of size chunk_size, and the chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number.When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread. Note that the size of the chunks is unspecified in this case.
For n = 10003 and n_threads = 4, you can specify chunk size of 2500 and the iteration space will be divied into chunks of size 2500, 2500, 2500, 2500 and 3 and they will be distributed to threads 0, 1, 2, 3 and 0. Thus thread 0 will get 2503 iterations but they will not be contiguous in the iteration space. If you do not specify the chunk size, it is up to the implementation to decide which thread to give the extra iterations to.

As far as I can tell, OpenMP doesn't guarantee exact chunk sizes, but it's not too hard to calculate them yourself. Here's some example code:
#include <stdio.h>
#include <omp.h>
int main(void) {
int n = 10003;
int n_threads = 4;
int chunk_size = n / n_threads;
#pragma omp parallel num_threads(n_threads)
{
int id = omp_get_thread_num();
int b = id * chunk_size;
int e = id == n_threads - 1 ? n : b + chunk_size;
printf("thread %d: %d items\n", id, e - b);
for (int i = b; i < e; i++) {
// process item i
}
}
return 0;
}
Sample output:
thread 0: 2500 items
thread 1: 2500 items
thread 3: 2503 items
thread 2: 2500 items
Beware: The strategy "each thread gets n / n_threads items, the last one more" is fine for the numbers you gave, but it may lead to very inefficient work sharing in other cases. For example, with 60 items and 16 threads, this formula would give all threads 3 items - except the last one, which would get 15 items. If processing each item takes roughly the same time, this would mean that the whole process takes about four times longer than necessary, and most CPU cores would be idle most of the time. I think you should only use this formula if there are good reasons why you need to distribute the work in exactly this way. Otherwise, the chunk sizes chosen by OpenMP are probably better.

parallel quicksort in c

After a lot of searching for an implementation of parallel quicksort in c, I'm about to dive in and code it myself. (I need to sort an array of about 1 million text strings.) It seems that all the implementations I have found divide the work inside the qsort function itself, which creates a huge amount of overhead in partitioning the relatively small amount of work per thread.
Would it not be much faster to divide the 1 million strings by the number of threads (in my case, 24 threads), and have them each work on a section, and then do a mergesort? Granted, this has the theoretical disadvantage that it is not an in-place sort, but with gobs of memory available it is not a problem. The machine this runs on has 12 (very fast) physical/24 logical cores and 192 GB (yes, gigabytes) of memory. Currently, even on this machine, the sort takes almost 8 minutes!

Would it not be much faster to divide
the 1 million strings by the number of
threads (in my case, 24 threads), and
have them each work on a section, and
then do a mergesort?
Its a good idea.
But you can make some observation by writing toy programs for quick-sort and merge-sort and take advantages of their algorithmic-/run-time-behavior.
For example. quick-sort sorts while dividing process (pivot element will be put in its final place at the end of that iteration) and merge-sort sorts while merging (sorting is done after the whole working-set is broken down (divided) into very granular-units where it can be directly compared with other granular-units (== or strcmp()).
Mixing up algorithms based on the nature of the working set is a good idea.
With respect to the parallel sorting, here is my parallel merge-sort for you to get started.
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#define NOTHREADS 2
/*
gcc -ggdb -lpthread parallel-mergesort.c
NOTE:
The mergesort boils downs to this..
Given two sorted array's how do we merge this?
We need a new array to hold the result of merging
otherwise it is not possible to do it using array,
so we may need a linked list
*/
int a[] = {10, 8, 5, 2, 3, 6, 7, 1, 4, 9};
typedef struct node {
int i;
int j;
} NODE;
void merge(int i, int j)
{
int mid = (i+j)/2;
int ai = i;
int bi = mid+1;
int newa[j-i+1], newai = 0;
while(ai <= mid && bi <= j) {
if (a[ai] > a[bi])
newa[newai++] = a[bi++];
else
newa[newai++] = a[ai++];
}
while(ai <= mid) {
newa[newai++] = a[ai++];
}
while(bi <= j) {
newa[newai++] = a[bi++];
}
for (ai = 0; ai < (j-i+1) ; ai++)
a[i+ai] = newa[ai];
}
void * mergesort(void *a)
{
NODE *p = (NODE *)a;
NODE n1, n2;
int mid = (p->i+p->j)/2;
pthread_t tid1, tid2;
int ret;
n1.i = p->i;
n1.j = mid;
n2.i = mid+1;
n2.j = p->j;
if (p->i >= p->j) return;
ret = pthread_create(&tid1, NULL, mergesort, &n1);
if (ret) {
printf("%d %s - unable to create thread - ret - %d\n", __LINE__, __FUNCTION__, ret);
exit(1);
}
ret = pthread_create(&tid2, NULL, mergesort, &n2);
if (ret) {
printf("%d %s - unable to create thread - ret - %d\n", __LINE__, __FUNCTION__, ret);
exit(1);
}
pthread_join(tid1, NULL);
pthread_join(tid2, NULL);
merge(p->i, p->j);
pthread_exit(NULL);
}
int main()
{
int i;
NODE m;
m.i = 0;
m.j = 9;
pthread_t tid;
int ret;
ret=pthread_create(&tid, NULL, mergesort, &m);
if (ret) {
printf("%d %s - unable to create thread - ret - %d\n", __LINE__, __FUNCTION__, ret);
exit(1);
}
pthread_join(tid, NULL);
for (i = 0; i < 10; i++)
printf ("%d ", a[i]);
printf ("\n");
// pthread_exit(NULL);
return 0;
}
Good luck!

Quicksort involves an initial pass over a list, which sorts the list into sections that are higher and lower than the pivot.
Why not do that in one thread, and then spawn another thread and delegate it to one half while the extant thread takes the other half, and so on and so forth?

Have you considered using a sorting algorithm specifically designed to sort strings?
It seems like it might be a better idea than trying to implement a custom quicksort. The specific choice of algorithms probably depends on the length of the strings and how different they are but a radix sort probably isn't a bad bet.
A quick google search turned up an article about sorting strings. I haven't read it but Sedgewick and Bentley really know their stuff. According to the abstract, their algorithm is an amalgam of Quicksort and radix sort.
Another possible solution is to wrap a parallel sorting algorithm from C++. GNU's STL implementation has a parallel mode, which contains a parallel quicksort implementation.
This is probably the easiest solution.

To make multi-threaded quicksort feasible memory accesses need to be optimized such that most of the sorting work is performed inside the non-shared caches (L1 & L2). My bet is that single-threaded quicksort will be faster than muli-threaded unless you're prepared to put in copious amounts of work.
One approach to test could be one thread to sort the upper half and another to sort the lower.
As to a special string-adapted sorting routine the concept sounds strange to me. I mean there aren't many cases where sorting a vector of only strings (or integers) is especially useful. Usually the data will be organized in a table with columns and rows and you will want to sort the rows by one column containing letters, and, if they're equal, you'll sort using an additional column containing a time stamp or a ranking or something else. So the sort routine should be able to handle a multi-level sort rule set that can specify any type of data (boolean, integer, dates, strings, floating point etc) in any direction (ascending or descending) present in the table's columns.

Can we parallelize this task?

Given a C string (array of characters terminating with a NULL character constant), we have to find the length of the string. Could you please suggest some ways to parallelize this for N number of threads of execution. I am having problem dividing into sub-problems as accessing a location of the array which is not present will give segmentation fault.
EDIT: I am not concerned that doing this task in parallel may have much greater overhead or not. Just want to know if this can be done (using something like openmp etc.)

No it can't. Because each step requires the previous state to be known (did we encounter a null on the previous char). You can only safely check 1 character at a time.
Imagine you are turning over rocks and you MUST stop at one with white paint underneath (null) or you will die (aka seg fault etc).
You can't have people "working ahead" of each other, as the white paint rock might be in between.
Having multiple people (threads/processes) would simply be them taking turns being the one turning over the next rock. They would never be turning over rocks at the same time as each other.

It's probably not even worth trying. If string is short, overhead will be greater than gain in processing speed. If string is really long, the speed will probably be limited by speed of memory, not by CPU processing speed.

I'd say with just a standard C-string this can not be done. However, if you can define a personal termination string with as many characters as processes - it's straight forward.

Do you know the maximum size of that char array? If so, you could do a parallel search in different junks and return the index of the terminator with smallest index.
Hence you are then only working on allocated memory, you cannot get segfaults.
Of course this is not as sophisticated as s_nairs answer but pretty straight forward.
example:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>
int main(int argc, char **argv)
{
int N=1000;
char *str = calloc(N, sizeof(char));
strcpy(str, "This is a test string!");
fprintf(stdout, "%s\n", str);
int nthreads = omp_get_num_procs();
int i;
int ind[nthreads];
for( i = 0; i < nthreads; i++){
ind[i] = -1;
}
int procn;
int flag;
#pragma omp parallel private(procn, flag)
{
flag = 1;
procn = omp_get_thread_num();
#pragma omp for
for( i = 0; i < N; i++){
if (str[i] == '\0' && flag == 1){
ind[procn] = i;
flag = 0;
}
}
}
int len = 0;
for( i = 0; i < nthreads; i++){
if(ind[i]>-1){
len = ind[i];
break;
}
}
fprintf(stdout,"strlen %d\n", len);
free(str);
return 0;
}

You could do something ugly like this in Windows enclosing unsafe memory reads in a SEH __try block:
#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define N 2
DWORD WINAPI FindZeroThread(LPVOID lpParameter)
{
const char* volatile* pp = (const char* volatile*)lpParameter;
__try
{
while (**pp)
{
(*pp) += N;
}
}
__except (EXCEPTION_EXECUTE_HANDLER)
{
*pp = NULL;
}
return 0;
}
size_t pstrlen(const char* s)
{
int i;
HANDLE handles[N];
const char* volatile ptrs[N];
const char* p = (const char*)(UINT_PTR)-1;
for (i = 0; i < N; i++)
{
ptrs[i] = s + i;
handles[i] = CreateThread(NULL, 0, &FindZeroThread, (LPVOID)&ptrs[i], 0, NULL);
}
WaitForMultipleObjects(N, handles, TRUE /* bWaitAll */, INFINITE);
for (i = 0; i < N; i++)
{
CloseHandle(handles[i]);
if (ptrs[i] && p > ptrs[i]) p = ptrs[i];
}
return (size_t)(p - s);
}
#define LEN (20 * 1000 * 1000)
int main(void)
{
char* s = malloc(LEN);
memset(s, '*', LEN);
s[LEN - 1] = 0;
printf("strlen()=%zu pstrlen()=%zu\n", strlen(s), pstrlen(s));
return 0;
}
Output:
strlen()=19999999 pstrlen()=19999999
I think it may be better to use MMX/SSE instructions to speed up the code in a somewhat parallel way.
EDIT: This may be not a very good idea on Windows after all, see Raymond Chen's
IsBadXxxPtr should really be called CrashProgramRandomly.

Let me acknowledge this,
Following code has been written using C# and not C. You can associate the idea what I am trying to articulate. And most of the content are from a Parallel Pattern (was a draft document by Microsoft on parallel approach)
To do the best static partitioning possible, you need to be able to accurately predict ahead of time how long all the iterations will take. That’s rarely feasible, resulting in a need for a more dynamic partitioning, where the system can adapt to changing workloads quickly. We can address this by shifting to the other end of the partitioning tradeoffs spectrum, with as much load-balancing as possible.
To do that, rather than pushing to each of the threads a given set of indices to process, we can have the threads compete for iterations. We employ a pool of the remaining iterations to be processed, which initially starts filled with all iterations. Until all of the iterations have been processed, each thread goes to the iteration pool, removes an iteration value, processes it, and then repeats. In this manner, we can achieve in a greedy fashion an approximation for the optimal level of load-balancing possible (the true optimum could only be achieved with a priori knowledge of exactly how long each iteration would take). If a thread gets stuck processing a particular long iteration, the other threads will compensate by processing work from the pool in the meantime. Of course, even with this scheme you can still find yourself with a far from optimal partitioning (which could occur if one thread happened to get stuck with several pieces of work significantly larger than the rest), but without knowledge of how much processing time a given piece of work will require, there’s little more that can be done.
Here’s an example implementation that takes load-balancing to this extreme. The pool of iteration values is maintained as a single integer representing the next iteration available, and the threads involved in the processing “remove items” by atomically incrementing this integer:
public static void MyParallelFor(
int inclusiveLowerBound, int exclusiveUpperBound, Action<int> body)
{
// Get the number of processors, initialize the number of remaining
// threads, and set the starting point for the iteration.
int numProcs = Environment.ProcessorCount;
int remainingWorkItems = numProcs;
int nextIteration = inclusiveLowerBound;
using (ManualResetEvent mre = new ManualResetEvent(false))
{
// Create each of the work items.
for (int p = 0; p < numProcs; p++)
{
ThreadPool.QueueUserWorkItem(delegate
{
int index;
while ((index = Interlocked.Increment(
ref nextIteration) - 1) < exclusiveUpperBound)
{
body(index);
}
if (Interlocked.Decrement(ref remainingWorkItems) == 0)
mre.Set();
});
}
// Wait for all threads to complete.
mre.WaitOne();
}
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight