Can we parallelize this task?

Can we parallelize this task? - c

Given a C string (array of characters terminating with a NULL character constant), we have to find the length of the string. Could you please suggest some ways to parallelize this for N number of threads of execution. I am having problem dividing into sub-problems as accessing a location of the array which is not present will give segmentation fault.
EDIT: I am not concerned that doing this task in parallel may have much greater overhead or not. Just want to know if this can be done (using something like openmp etc.)

No it can't. Because each step requires the previous state to be known (did we encounter a null on the previous char). You can only safely check 1 character at a time.
Imagine you are turning over rocks and you MUST stop at one with white paint underneath (null) or you will die (aka seg fault etc).
You can't have people "working ahead" of each other, as the white paint rock might be in between.
Having multiple people (threads/processes) would simply be them taking turns being the one turning over the next rock. They would never be turning over rocks at the same time as each other.

It's probably not even worth trying. If string is short, overhead will be greater than gain in processing speed. If string is really long, the speed will probably be limited by speed of memory, not by CPU processing speed.

I'd say with just a standard C-string this can not be done. However, if you can define a personal termination string with as many characters as processes - it's straight forward.

Do you know the maximum size of that char array? If so, you could do a parallel search in different junks and return the index of the terminator with smallest index.
Hence you are then only working on allocated memory, you cannot get segfaults.
Of course this is not as sophisticated as s_nairs answer but pretty straight forward.
example:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>
int main(int argc, char **argv)
{
int N=1000;
char *str = calloc(N, sizeof(char));
strcpy(str, "This is a test string!");
fprintf(stdout, "%s\n", str);
int nthreads = omp_get_num_procs();
int i;
int ind[nthreads];
for( i = 0; i < nthreads; i++){
ind[i] = -1;
}
int procn;
int flag;
#pragma omp parallel private(procn, flag)
{
flag = 1;
procn = omp_get_thread_num();
#pragma omp for
for( i = 0; i < N; i++){
if (str[i] == '\0' && flag == 1){
ind[procn] = i;
flag = 0;
}
}
}
int len = 0;
for( i = 0; i < nthreads; i++){
if(ind[i]>-1){
len = ind[i];
break;
}
}
fprintf(stdout,"strlen %d\n", len);
free(str);
return 0;
}

You could do something ugly like this in Windows enclosing unsafe memory reads in a SEH __try block:
#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define N 2
DWORD WINAPI FindZeroThread(LPVOID lpParameter)
{
const char* volatile* pp = (const char* volatile*)lpParameter;
__try
{
while (**pp)
{
(*pp) += N;
}
}
__except (EXCEPTION_EXECUTE_HANDLER)
{
*pp = NULL;
}
return 0;
}
size_t pstrlen(const char* s)
{
int i;
HANDLE handles[N];
const char* volatile ptrs[N];
const char* p = (const char*)(UINT_PTR)-1;
for (i = 0; i < N; i++)
{
ptrs[i] = s + i;
handles[i] = CreateThread(NULL, 0, &FindZeroThread, (LPVOID)&ptrs[i], 0, NULL);
}
WaitForMultipleObjects(N, handles, TRUE /* bWaitAll */, INFINITE);
for (i = 0; i < N; i++)
{
CloseHandle(handles[i]);
if (ptrs[i] && p > ptrs[i]) p = ptrs[i];
}
return (size_t)(p - s);
}
#define LEN (20 * 1000 * 1000)
int main(void)
{
char* s = malloc(LEN);
memset(s, '*', LEN);
s[LEN - 1] = 0;
printf("strlen()=%zu pstrlen()=%zu\n", strlen(s), pstrlen(s));
return 0;
}
Output:
strlen()=19999999 pstrlen()=19999999
I think it may be better to use MMX/SSE instructions to speed up the code in a somewhat parallel way.
EDIT: This may be not a very good idea on Windows after all, see Raymond Chen's
IsBadXxxPtr should really be called CrashProgramRandomly.

Let me acknowledge this,
Following code has been written using C# and not C. You can associate the idea what I am trying to articulate. And most of the content are from a Parallel Pattern (was a draft document by Microsoft on parallel approach)
To do the best static partitioning possible, you need to be able to accurately predict ahead of time how long all the iterations will take. That’s rarely feasible, resulting in a need for a more dynamic partitioning, where the system can adapt to changing workloads quickly. We can address this by shifting to the other end of the partitioning tradeoffs spectrum, with as much load-balancing as possible.
To do that, rather than pushing to each of the threads a given set of indices to process, we can have the threads compete for iterations. We employ a pool of the remaining iterations to be processed, which initially starts filled with all iterations. Until all of the iterations have been processed, each thread goes to the iteration pool, removes an iteration value, processes it, and then repeats. In this manner, we can achieve in a greedy fashion an approximation for the optimal level of load-balancing possible (the true optimum could only be achieved with a priori knowledge of exactly how long each iteration would take). If a thread gets stuck processing a particular long iteration, the other threads will compensate by processing work from the pool in the meantime. Of course, even with this scheme you can still find yourself with a far from optimal partitioning (which could occur if one thread happened to get stuck with several pieces of work significantly larger than the rest), but without knowledge of how much processing time a given piece of work will require, there’s little more that can be done.
Here’s an example implementation that takes load-balancing to this extreme. The pool of iteration values is maintained as a single integer representing the next iteration available, and the threads involved in the processing “remove items” by atomically incrementing this integer:
public static void MyParallelFor(
int inclusiveLowerBound, int exclusiveUpperBound, Action<int> body)
{
// Get the number of processors, initialize the number of remaining
// threads, and set the starting point for the iteration.
int numProcs = Environment.ProcessorCount;
int remainingWorkItems = numProcs;
int nextIteration = inclusiveLowerBound;
using (ManualResetEvent mre = new ManualResetEvent(false))
{
// Create each of the work items.
for (int p = 0; p < numProcs; p++)
{
ThreadPool.QueueUserWorkItem(delegate
{
int index;
while ((index = Interlocked.Increment(
ref nextIteration) - 1) < exclusiveUpperBound)
{
body(index);
}
if (Interlocked.Decrement(ref remainingWorkItems) == 0)
mre.Set();
});
}
// Wait for all threads to complete.
mre.WaitOne();
}
}

Related

In this case, how to save data more efficiently and conveniently?

I am measuring the latency of some operations.
There are many scenarios here.
The delay of each scene is roughly distributed in a small interval. For each scenario, I need to measure 500,000 times. Finally I want to output the delay value and its corresponding number of times.
My initial implementation was:
#define range 1000
int rec_array[range];
for (int i = 0; i < 500000; i++) {
int latency = measure_latency();
rec_array[latency]++;
}
for (int i = 0; i < range; i++) {
printf("%d %d\n", i, rec_array[i]);
}
But this approach was fine at first, but as the number of scenes grew, it became problematic.
The delay measured in each scene is concentrated in a small interval. So for most of the data in the rec_array array is 0.
Since each scene is different, the delay value is also different. Some delays are concentrated around 500, and I need to create an array with a length greater than 500. But some are concentrated around 5000, and I need to create an array with a length greater than 5000.
Due to the large number of scenes, I created too many arrays. For example I have ten scenes and I need to create ten rec_arrays. And I also set them to be different lengths.
Is there any efficient and convenient strategy? Since I am using C language, templates like vector cannot be used.
I considered linked lists. However, considering that the interval of the delay value distribution is uncertain, and how many certain delay values are uncertain, and when the same delay occurs, the timing value needs to be increased. It also doesn't seem very convenient.
I'm sorry, I just went out. Thank you for your help. I read the comments carefully. Here are some of my answers.
These data are mainly used to draw pictures,For example, this one below.
The comment area says that data seems small. The main reason why I thought about this problem is that according to the picture, only a few arrays are used each time, and the vast majority are 0. And there are many scenarios where I need to generate an array for each. I have referred to an open source implementation.
According to the comments, it seems that using arrays directly is a good solution, considering fast access. Thanks veru much!

A linked list is probably (and almost always) the least efficient way to store things – both slow as hell, and memory inefficient, since your values use less storage than your pointers. Linked lists are very rarely a good solution for anything that actually stores significant data. The only reason they're so prevalent is that C still has no proper containers, and they're easy wheels to
reinvent for every single C program you write.
#define range 1000
int rec_array[range];
So you're (probably! This depends on your compiler and where you write int rec_array[range];) storing rec_array on the stack, and it's large. (Actually, 4000 Bytes is not "large" by any modern computer's means, but still.) You should not be doing that; instead, this should be heap allocated, once, at initialization.
The solution is to allocate it:
/* SPDX-License-Identifier: LGPL-2.1+ */
/* Copyright Marcus Müller and others */
#include <stdlib.h>
#define N_RUNS 500000
/*
* Call as
* program maximum_latency
*/
unsigned int *run_benchmark(struct task_t task, unsigned int *latencies,
unsigned int *max_latency) {
for (unsigned int run = 0; run < N_RUNS; ++run) {
unsigned int latency = measure_latency();
if (latency >= *max_latency) {
latency = *max_latency - 1;
/*
* alternatively: use realloc to increase the size of the `latencies`,
* and update max_latency as well; that's basically what C++ std::vector
* does
*/
(latencies[latency])++;
}
}
return latencies;
}
int main(int argc, char **argv) {
// check argument
if (argc != 2) {
exit(127);
}
int maximum_latency_raw = atoi(argv[1]);
if (maximum_latency_raw <= 0) {
exit(126);
}
unsigned int maximum_latency = maximum_latency_raw;
/*
* note that the length does no longer have to be a constant
* if you're using calloc/malloc.
*/
unsigned int *latency_counters =
(unsigned int *)calloc(maximum_latency, sizeof(unsigned int));
for (; /* benchmark task in benchmark_tasks */;) {
run_benchmark(task, latency_counters, &maximum_latency);
print_benchmark_result(latency_counters, maximum_latency);
// clear our counters after each run!
memset(latency_counters, 0, maximum_latency * sizeof(unsigned int));
}
}
void print_benchmark_result(unsigned int *array, unsigned int length) {
for (unsigned int index = 0; index < length; ++index) {
printf("%d %d\n", i, rec_array[i]);
}
puts("============================\n");
}
Note especially the "alternatively: realloc" comment in the middle: realloc allows you to increase the size of your array:
unsigned int *run_benchmark(struct task_t task, unsigned int *latencies,
unsigned int *max_latency) {
for (unsigned int run = 0; run < N_RUNS; ++run) {
unsigned int latency = measure_latency();
if (latency >= *max_latency) {
// double the size!
latencies = (unsigned int *)realloc(latencies, (*max_latency) * 2 *
sizeof(unsigned int));
// realloc doesn't zero out the extension, so we need to do that
// ourselves.
memset(latencies + (*max_latency), 0, (*max_latency)*sizeof(unsigned int);
(*max_latency) *= 2;
(latencies[latency])++;
}
}
return latencies;
}
This way, your array grows when you need it to!

how about using a Hash table so we would only save the latency used and maybe the keys in the Hash table can be ranges while the values of said keys be the actual latency?

Just sacrifice some precision in your latencies like 0-15, 16-31, 32-47 ... etc. Now your array will be 16x smaller.
Allocate all latency counter arrays for all scenes in one go
unsigned int *latency_div16_counter = (unsigned int *)calloc((MAX_LATENCY >> 4) * NUM_OF_SCENES, sizeof(unsigned int));
Clamp the values to the max latency, div 16 and store
for (int scene = 0; scene < NUM_OF_SCENES; scene++) {
for (int i = 0; i < 500000; i++) {
int latency = measure_latency();
if(latency >= MAX_LATENCY) latency = MAX_LATENCY - 1;
latency = latency >> 4; // int div 16
latency_div16_counter[(scene * MAX_LATENCY) + latency]++;
}
}
Adjust the data (mul 16) before displaying it
for (int scene = 0; scene < NUM_OF_SCENES; scene++) {
for (int i = 0; i < (MAX_LATENCY >> 4); i++) {
printf("Scene %d Latency %d Total %d\n", scene, i * 16, latency_div16_counter[i]);
}
}

C for loop optimisation by embedding statements into loop-head itself

Just wondering if these variations of for loops are more efficient and practical.
By messing with the c for loop syntax i can embedd statements that would go in the loop-body into the loop-head like so:
Example 1:
#include <stdio.h>
int main(int argc, char ** argv)
{
// Simple program that prints out the command line arguments passed in
if (argc > 1)
{
for(int i = 1; puts(argv[i++]), i < argc;);
// This does the same as this:
// for(int i = 1; i < argc; i++)
// {
// puts(argv[i]);
// }
}
return 0;
}
I understand how the commas work in the for loop it goes through each statement in order, evaluates them then disregards all but the last one which is why it is able to iterate using the "i < argc"condition. There is no need for the final segment to increment the i variable as i did that in the middle segment of the loop head (in the puts(argv[i++]) bit).
Is this more efficient or is just just cleaner to seperate it into the loop body rather than combine it all into one line?
Example 2:
int stringLength(const char * string)
{
// Function that counts characters up until null terminator character and returns the total
int counter = 0;
for(counter; string[counter] != '\0'; counter++);
return counter;
// Same as:
//int counter = 0;
// for(int i = 0; string[i] != '\0'; i++)
//{
// counter++;
//}
//return counter;
}
This one seems more efficient than the version with the loop body as no local variable for the for-loop is initialised. Is it conventional to do these sorts of loops with no bodies?

Step 1: Correctness
Make sure code is correct.
Consider OP's code below. Does it attempt to print argv[argc] which would be bad?
if (argc > 1) {
for(int i = 1; puts(argv[i++]), i < argc;);
I initially thought it did. So did another user. Yet it OK.
… and this is exactly why code is weak.
Code not only should be correct, better code looks correct too. Using an anti-pattern as suggested by OP is rarely1 as good thing.
Step 2: Since code variations have the same big O, focus on understandably.
Sculpt your code – remove what is not needed.
for (int i = 1; i < argc; i++) {
puts(argv[i]);
}
What OP is doing is a trivial optimization concern.
Is premature optimization really the root of all evil?
Is it conventional to do these sorts of loops with no bodies?
Not really.
The key to the style of coding is to follow your group's style guide. Great software is often a team effort. If your group's likes to minimize bodies, go ahead. I have seen the opposite more common, explicit { some_code } bodies.
Note: int stringLength(const char * string) fails for strings longer than INT_MAX. Better to use size_t as the return type – thus an example of step 1 faltering.
1 All coding style rules, except this rule, have exceptions.

Use the maximum CPU to solve the permutations of 10 in the shortest possible time

I am using this program written in C to determine the permutations of size 10 of an regular alphabet.
When I run the program it only uses 36% of my 3GHz CPU leaving 50% free. It also only uses 7MB of my 8GB of RAM.
I would like to use at least 70-80% of my computer's performance and not just this misery. This limitation is making this procedure very time consuming and I don't know when (number of days) I will need to have the complete output. I need help to resolve this issue in the shortest possible time, whether improving the source code or other possibilities.
Any help is welcome even if this solution goes through instead of using the C language use another one that gives me better performance in the execution of the program.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
static int count = 0;
void print_permutations(char arr[], char prefix[], int n, int k) {
int i, j, l = strlen(prefix);
char newprefix[l + 2];
if (k == 0) {
printf("%d %s\n", ++count, prefix);
return;
}
for (i = 0; i < n; i++) {
//Concatenation of currentPrefix + arr[i] = newPrefix
for (j = 0; j < l; j++)
newprefix[j] = prefix[j];
newprefix[l] = arr[i];
newprefix[l + 1] = '\0';
print_permutations(arr, newprefix, n, k - 1);
}
}
int main() {
int n = 26, k = 10;
char arr[27] = "abcdefghijklmnopqrstuvwxyz";
print_permutations(arr, "", n, k);
system("pause");
return 0;
}

There are fundamental problems with your approach:
What are you trying to achieve?
If you want to enumerate the permutations of size 10 of a regular alphabet, your program is flawed as it enumerates all combinations of 10 letters from the alphabet. Your program will produce 2610 combinations, a huge number, 141167095653376, 141,167 billion! Ignoring the numbering, which will exceed the range of type int, that's more than 1.5 Petabytes, unlikely to fit on your storage space. Writing this at the top speed of 100MB/s would take more than 20 days.
The number of permutations, that is combinations of distinct letters from the 26 letter alphabet is not quite as large: 26! / 16! which is still large: 19275223968000, 7 times less than the previous result. That is still more than 212 terabytes of storage and 3 days at 100MB/s.
Storing these permutations is therefore impractical. You could change your program to just count the permutations and measure how long it takes if the count is the expected value. The first step of course is to correct your program to produce the correct set.
Test on smaller sets to verify correctness
Given the expected size of the problem, you should first test for smaller values such as enumerating permutations of 1, 2 and 3 letters to verify that you get the expected number of results.
Once you have correctness, only then focus on performance
Selecting different output methods, from printf("%d %s\n", ++count, prefix); to ++count; puts(prefix); to just ++count;, you will see that most of the time is spent in producing the output. Once you stop producing output, you might see that strlen() consumes a significant fraction of the execution time, which is useless since you can pass the prefix length from the caller. Further improvements may come from using a common array for the current prefix, precluding the need to copy at each recursive step.
Using multiple threads each producing its own output, for example each with a different initial letter, will not improve the overall time as the bottleneck is the bandwidth of the output device. But if you reduce the program to just enumerate and count the permutations, you might get faster execution with multiple threads, one per core, thereby increasing the CPU usage. But this should be the last step in your development.
Memory use is no measure of performance
Using as much memory as possible is not a goal in itself. Some problems may require a tradeoff between memory and time, where faster solving times are achieved using more core memory, but this one does not. 8MB is actually much more than your program's actual needs: this count includes the full stack space assigned to the program, of which only a tiny fraction will be used.
As a matter of fact, using less memory may improve overall performance as the CPU will make better use of its different caches.
Here is a modified program:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
static unsigned long long count;
void print_permutations(char arr[], int n, char used[], char prefix[], int pos, int k) {
if (pos == k) {
prefix[k] = '\0';
++count;
//printf("%llu %s\n", count, prefix);
//puts(prefix);
return;
}
for (int i = 0; i < n; i++) {
if (!used[i]) {
used[i] = 1;
prefix[pos] = arr[i];
print_permutations(arr, n, used, prefix, pos + 1, k);
used[i] = 0;
}
}
}
int main(int argc, char *argv[]) {
int n = 26, k = 10;
char arr[27] = "abcdefghijklmnopqrstuvwxyz";
char used[27] = { 0 };
char perm[27];
unsigned long long expected_count;
clock_t start, elapsed;
if (argc >= 2)
k = strtol(argv[1], NULL, 0);
if (argc >= 3)
n = strtol(argv[2], NULL, 0);
start = clock();
print_permutations(arr, n, used, perm, 0, k);
elapsed = clock() - start;
expected_count = 1;
for (int i = n; i > n - k; i--)
expected_count *= i;
printf("%llu permutations, expected %llu, %.0f permutations per second\n",
count, expected_count, count / ((double)elapsed / CLOCKS_PER_SEC));
return 0;
}
Without output, this program enumerates 140 million combinations per second on my slow laptop, it would take 1.5 days to enumerate the 19275223968000 10-letter permutations from the 26-letter alphabet. It uses almost 100% of a single core, but the CPU is still 63% idle as I have a dual core hyper-threaded Intel Core i5 CPU. Using multiple threads should yield increased performance, but the program must be changed to no longer use a global variable count.

There are multiple reasons for your bad experience:
Your metric:
Your metric is fundamentally flawed. Peak-CPU% is an imprecise measurement for "how much work does my CPU do". Which normally isn't really what you're most interested in. You can inflate this number my doing more work (like starting another thread that doesn't contribute to the output at all).
Your proper metric would be items per second: How many different strings will be printed or written to a file per second. To measure that, start a test run with a smaller size (like k=4), and measure how long it takes.
Your problem: Your problem is hard. Printing or writing down all 26^10 ~1.4e+14 different words with exactly 10 letters will take some time. Even if you changed it to all permutations - which your program doesn't do - it's still ~1.9e13. The resulting file will be 1.4 petabytes - which is most likely more than your hard drive will accept. Also, if you used your CPU to 100% and used one thousand cycles for one word, it'd take 1.5 years. 1000 cycles are an upper bound, you most likely won't be faster that this while still printing your result, as printf usually takes around 1000 cycles to complete.
Your output: Writing to stdout is slow comapred to writing to a file, see https://stackoverflow.com/a/14574238/4838547.
Your program: There are issues with your program that could be a problem for your performance. However, they are dominated by the other problems stated here. With my setup, this program uses 93.6% of its runtime in printf. Therefore, optimizing this code won't yield satisfying results.

How can this combination algorithm be modified to run in parallel on a cuda enabled gpu?

I currently have a c program that generates all possible combinations of a character string. Please note combinations, not permutations. This is the program:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
//Constants
static const char set[] = "abcd";
static const int setSize = sizeof(set) - 1;
void brute(char* temp, int index, int max){
//Declarations
int i;
for(i = 0; i < setSize; i++){
temp[index] = set[i];
if(index == max - 1){
printf("%s\n", temp);
}
else{
brute(temp, index + 1, max);
}
}
}
void length(int max_len){
//Declarations
char* temp = (char *) malloc(max_len + 1);
int i;
//Execute
for(i = 1; i <= max_len; i++){
memset(temp, 0, max_len + 1);
brute(temp, 0, i);
}
free(temp);
}
int main(void){
//Execute
length(2);
getchar();
return 0;
}
The maximum length of the combinations can be modified; it is currently set to 2 for demonstration purposes. Given how it's currently configured, the program outputs
a, b, c, d, aa, ab, ac, ad, ba, bb, bc, bd, ca, cb, cc...
I've managed to translate this program into cuda:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
//On-Device Memory
__constant__ char set_d[] = "adcd";
__constant__ int setSize_d = 4;
__device__ void brute(char* temp, int index, int max){
//Declarations
int i;
for(i = 0; i < setSize_d; i++){
temp[index] = set_d[i];
if(index == max - 1){
printf("%s\n", temp);
}
else{
brute(temp, index + 1, max);
}
}
}
__global__ void length_d(int max_len){
//Declarations
char* temp = (char *) malloc(max_len + 1);
int i;
//Execute
for(i = 1; i <= max_len; i++){
memset(temp, 0, max_len+1);
brute(temp, 0, i);
}
free(temp);
}
int main()
{
//Execute
cudaSetDevice(0);
//Launch Kernel
length_d<<<1, 1>>>(2);
cudaDeviceSynchronize();
getchar(); //Keep this console open...
return 0;
}
The cuda version of the original program is basically an exact copy of the c program (note that it is being compiled with -arch=sm_20. Therefore, printf and other host functions work in the cuda environment).
My goal is to compute combinations of a-z, A-Z, 0-9, and other characters of maximum lengths up to 10. That being the case, I want this program to run on my gpu. As it is now, it does not take advantage of parallel processing - which obviously defeats the whole purpose of writing the program in cuda. However, I'm not sure how to remove the recursive nature of the program in addition to delegating the threads to a specific index or starting point.
Any constructive input is appreciated.
Also, I get an occasional warning message on successive compiles (meaning it sporadically appears): warning : Stack size for entry function '_Z8length_di' cannot be statically determined.
I haven't pinpointed the problem yet, but I figured I would post it in case anyone identified the cause before I can. It is being compiled in visual studio 2012.
Note: I found this to be fairly interesting. As the cuda program is now, its output to the console is periodic - meaning that it prints a few dozen combinations, pauses, prints a few dozen combinations, pauses, and so forth. I also observe this behavior in its reported gpu usage - it periodically swings from 5% to 100%.

I don't think you need to use recursion for this. (I wouldn't).
Using printf from the kernel is problematic for large amounts of output; it's not really designed for that purpose. And printf from the kernel eliminates any speed benefit the GPU might have. And I assume if you're testing a large vector space like this, your goal is not to print out every combination. Even if that were your goal, printf from the kernel is not the way to go.
Another issue you will run into is storage for the entire vector space you have in mind. If you have some processing you intend to do on each vector and then you can discard it, then storage is not an issue. But storage for a vector space of length n=10 with "digits" (elements) that have k=62 or more possible values (a..z, A..Z, 0..9, etc.) will be huge. It's given by k^n, so in this example that would be 62^10 different vectors. If each digit required a byte to store it, that would be over 7 trillion gigabytes. So this pretty much dictates that storage of the entire vector space is out of the question. Whatever work you're going to do, you're going to have to do it on the fly.
Given the above discussion, this answer should have pretty much everything that you need. The vector digits are handled as unsigned int, you can create whatever mapping you want between unsigned int and your "digits" (i.e. a..z, A..Z, 0..9, etc.) In that example, the function that was performed on each vector was testing if the sum of the digits matched a certain value, so you could replace this line in the kernel:
if (vec_sum(vec, n) == sum) atomicAdd(count, 1UL);
with whatever function and processing you wanted to apply to each vector generated. You could even put a printf here, but for larger spaces the output will be fragmented and incomplete.

Does this multithreaded program perform better than the non-multithreaded one?

A colleague of mine asked me to write a homework for him. Although this wasn’t too ethical I did it, I plead guilty.
This is how the problem goes:
Write a program in C where the sequence 12 + 22 + ... + n2 is calculated.
Assume that n is multiple of p and p is the number of threads.
This is what I wrote:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define SQR(X) ((X) * (X))
int n, p = 10, total_sum = 0;
pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;
/* Function prototype */
void *do_calc(void *arg);
int main(int argc, char** argv)
{
int i;
pthread_t *thread_array;
printf("Type number n: ");
fscanf(stdin, "%d", &n);
if (n % p != 0 ) {
fprintf(stderr, "Number must be multiple of 10 (number of threads)\n");
exit(-1);
}
thread_array = (pthread_t *) malloc(p * sizeof(pthread_t));
for (i = 0; i < p; i++)
pthread_create(&thread_array[i], NULL, do_calc, (void *) i);
for (i = 0; i < p; i++)
pthread_join(thread_array[i], NULL);
printf("Total sum: %d\n", total_sum);
pthread_exit(NULL);
}
void *do_calc(void *arg)
{
int i, local_sum = 0;
int thr = (int) arg;
pthread_mutex_lock(&mtx);
for (i = thr * (n / p); i < ((thr + 1) * (n / p)); i++)
local_sum += SQR(i + 1);
total_sum += local_sum;
pthread_mutex_unlock(&mtx);
pthread_exit(NULL);
}
Aside from the logical/syntactic point of view, I was wondering:
how the respective non-multithreaded program would perform
how could I test/see their performance
what would be the program without using threads
Thanks in advance and I’m looking forward to reading your thoughts

You are acquiring the Mutex before the calculations. You should do that immediately before summing to local values.
pthread_mutex_lock(&mtx);
total_sum += local_sum;
pthread_mutex_unlock(&mtx);

This would depend on how many CPUs you have. With a single CPU core, a computation-bound program will never run faster with multiple threads.
Moreover, since you're doing all the work with the lock held, you'll end up with only a single thread running at any time, so it's effectively single threaded anyway.

Don't bother with threading etc. In fact, don't do any additions in a loop at all. Just use this formula:
∑(r = 1; n) r^2 = 1/6 * n (n + 1)(2 n + 1) [1]
[1]http://thesaurus.maths.org/mmkb/entry.html?action=entryById&id=1539

As your code is serialised by a mutex in the actual calculation, it will be slower than a non-threaded version. Of course, you could easily have tested this for yourself.

i would try to see how much do those calculations take. In case it's a very small fraction of time then i would probably gone for a single process model since spawning a thread for each calculation involves some overhead by it self.

to compare performance just remember system time at program start, call it from n=1000 and see system time at the end. compare to non-threaded program result.
as bdonlan said, non-threaded will run faster

1) Single threaded would probably perform a bit better than this, because all calculations are done within a lock and the overhead of locking will add to the total time. You are better off only locking when adding the local sums to the total sum, or storing the local sums in an array and calculating the total sum in the main thread.
2) Use timing statements in your code to measure elapsed time during the algoritm. In the multithreaded case, only measure elapsed time on the main thread.
3) Derived from your code:
int i, total_sum = 0;
for (i = 0; i < n; i++)
total_sum += SQR(i + 1);

A much larger consideration comes to scheduling. The easiest way for kernel-side threading to be implemented is for each thread to get equal time regardless. Processes are just threads with their own memory space. IF all threads get equal time, adding a thread takes you from 1/n of the time to 2/(n + 1) of the time, which is obviously better given > 0 other threads that aren't you.
Actual implementations may and do vary wildly though.

Off-topic a bit, but maybe avoid the mutex by having each thread write it's result into an array element (so assign "results = calloc(sizeof(int), p)" (btw "p" is an awful name for the variable holding the number of threads) and results[thr] = local_sum), and have the joining thread (well, main()) do the summing of the results. So each thread is responsible for just calculating its total: only main(), which orchestrates the threads, joins their data together. Separation of concerns.
For extra credit (:p), use the arg passed to do_calc() as a way to pass the thread ID and the location to write the result to rather than relying on a global array.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight