I need to solve the knapsack problem recursively, memoized and with dynamic programming. Currently I'm stuck at the dynamic programming method.
I adapted the code from what I found elsewhere on the internet. Currently the output is not correct.
The problem involves profit and mass. Each item has a profit and mass associated, there is a MAX_N (umber) of items available and a MAX_CAPACITY for mass. The aim is to have as much "profit" in the knapsack as possible.
Here is an example provided by the exercise:
Example: Given a knapsack of capacity 5, and items with mass[] = {2, 4, 3, 2}
and profit profit[] = {45, 40, 25, 15}, the best combination would be item 0 (with mass 2 and profit 45) and item 2 (with mass 3 and with profit 25) for a total profit of 70. No other combination with mass 5 or less has a greater profit.
Here is the complete code:
#include <stdio.h>
#include <stdlib.h>
#define MAX_N 10
#define MAX_CAPACITY 165
int m[MAX_N][MAX_CAPACITY];
int max(int x, int y) {
return x ^ ((x ^ y) & -(x < y));
}
int min(int x, int y) {
return y ^ ((x ^ y) & -(x < y));
}
int knapsackRecursive(int capacity, int mass[], int profit[], int n) {
if (n < 0)
return 0;
if (mass[n] > capacity)
return knapsackRecursive(capacity, mass, profit, n-1);
else
return max(knapsackRecursive(capacity, mass, profit, n-1), knapsackRecursive(capacity - mass[n], mass, profit, n-1) + profit[n]);
}
int knapsackMemoized(int capacity, int mass[], int profit[], int n) {
}
int knapsackDynamic(int capacity, int mass[], int profit[], int n) {
int i;
int j;
for (i = 0; i <= n; i++) {
for (j = 0; j <= capacity; j++) {
if (i == 0 || j == 0)
m[i][j] = 0;
else if (mass[i-1] <= j)
m[i][j] = max(profit[i-1] + m[i-1][j-mass[i-1]], m[i-1][j]);
else
m[i][j] = m[i-1][j];
}
}
return m[n][capacity];
}
void test() {
// test values
//int M1[MAX_N] = {2, 4, 3, 2};
//int P1[MAX_N] = {45, 40, 25, 10};
int M1[MAX_N] = {6, 3, 2, 4};
int P1[MAX_N] = {50, 60, 40, 20};
int M2[MAX_N] = {23, 31, 29, 44, 53, 38, 63, 85, 89, 82};
int P2[MAX_N] = {92, 57, 49, 68, 60, 43, 67, 84, 87, 72};
// a)
printf("Recursion: %d\n",knapsackRecursive(MAX_CAPACITY, M1, P1, MAX_N));
printf("Recursion: %d\n",knapsackRecursive(MAX_CAPACITY, M2, P2, MAX_N));
printf("\n");
// b)
printf("Memoization: %d\n",knapsackMemoized(MAX_CAPACITY, M1, P1, MAX_N));
printf("Memoization: %d\n",knapsackMemoized(MAX_CAPACITY, M2, P2, MAX_N));
printf("\n");
// c)
printf("Dynamic Programming: %d\n",knapsackDynamic(MAX_CAPACITY, M1, P1, MAX_N));
printf("Dynamic Programming: %d\n",knapsackDynamic(MAX_CAPACITY, M2, P2, MAX_N));
}
int main() {
test();
}
This is the output I currently get. The recursive method should be supplying the correct result, but the dynamic programming one currently doesn't output the same. Memoization is not done yet, hence it doesn't output correctly either.
Recursion: 170
Recursion: 309
Memoization: 2686680
Memoization: 2686600
Dynamic Programming: 0
Dynamic Programming: 270
Process returned 25 (0x19) execution time : 0.269 s
Press any key to continue.
It turns out that the code I used for writing the dynamic programming part was supposed to work with int m[MAX_N+1][MAX_CAPACITY+1]; instead of int m[MAX_N][MAX_CAPACITY];.
Changing that has gotten me to a working code, if not really the code I wanted.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
I have to generate the set Z of the first 100 integers that satisfy the equation i = 2^a * 3^b, with a and b being integers.
That is, Z = {1, 2, 3, 4, 6, 8, 9, 12, ...}
What algorithm could I use ? I'll need to implement it in C.
In C
#include <stdio.h>
#include <math.h>
#include <stdint.h>
typedef unsigned long long int ull;
ull cmp(const void * a, const void * b) { return *(ull *)a - *(ull *)b; }
int main() {
int i = 0, a, b;
int A = 17,
B = 16;
int MAX = A * B;
ull z[MAX];
for (b = 0; b < B; ++b) {
for (a = 0; a < A; ++a) {
if (i >= MAX) break;
z[i++] = pow(2, a) * pow(3, b);
}
}
qsort(z, MAX, sizeof(ull), cmp);
printf("{ ");
for (i = 0; i < 100; ++i)
printf("%lld%c ", z[i], i < 99 ? ',' : 0);
printf("}");
return 0;
}
Output
{ 1, 2, 3, 4, 6, 8, 9, 12, 16, 18, 24, 27, 32, 36, 48, 54, 64, 72, 81, 96, 108, 128, 144, 162, 192, 216, 243, 256, 288, 324, 384, 432, 486, 512, 576, 648, 729, 768, 864, 972, 1024, 1152, 1296, 1458, 1536, 1728, 1944, 2048, 2187, 2304, 2592, 2916, 3072, 3456, 3888, 4096, 4374, 4608, 5184, 5832, 6144, 6561, 6912, 7776, 8192, 8748, 9216, 10368, 11664, 12288, 13122, 13824, 15552, 16384, 17496, 18432, 19683, 20736, 23328, 24576, 26244, 27648, 31104, 32768, 34992, 36864, 39366, 41472, 46656, 49152, 52488, 55296, 59049, 62208, 65536, 69984, 73728, 78732, 82944, 93312 }
EDIT: Gives correct output now without overflow (see http://ideone.com/Rpbqms)
too much brute force...
let me propose a O(n*lg n) time O(n) space algorithm to achieve these.
i am not gonna provide any real code, but a piece of self invented pseudocode.
the idea is to use min-heap to maintain ordering:
func first-n-of-that(limit)
heap = min-heap()
heap.insert 1
results = []
while results.length < limit
to-add = heap.pop
results.add to-add
heap.insert 2 * to-add
heap.insert 3 * to-add
return results
the correctness is provable by deduction.
Brute force in Python (I know that C code is required):
sorted(2**a*3**b for a in range(100) for b in range(100))[:100]
And the result is …
I want to make a program to get the weighted average (formula = (mark*(credits corresponds to it))/total credits) from the best(highest) 120 credits module from the following marks and credits (the credits corresponds to the module):
module[12]={48, 77, 46, 82, 85, 43, 49, 73, 65, 48, 47, 51}
credits[12]={60, 20, 20, 20, 10, 20, 10, 10, 10, 20, 20, 10}
What I have done is bubble sort the array, so that the array is sorted by decreasing manner to know which marks is higher, as shown below:
module[12]={85, 82, 77, 73, 65, 51, 49, 48, 48, 47, 46, 43}
credits[12]={10, 20, 20, 10, 10, 10, 10, 60, 20, 20, 20, 20}
And then I need to choose the best 120 credits module from the sorted array so that the weighted average will be maximum, but then I have no idea where to start. =(
Someone help me! Thanks a lot!
EDIT:
I have tried to work out the code myself, and eventually get the following code, it works, but for some special case it stop working =(
float credits=0, result=0;
n=0;
struct{
float credits;
float result;
float n;
float addpoint;
}point;
while (credits < 120){
credits+=credits[n];
result+=(result[n]*credits[n]);
n++;
}
if (credits != 120){
credits -= credits[n-1];
result -= (result[n-1]*credits[n-1]);
point.credits = credits;
point.result = result;
point.n = (n-1)-1;
point.addpoint = n;
again: while (credits < 120){
credits+=credits[n];
result+=(result[n]*credits[n]);
n++;
}
if (credits != 120){
point.credits -= credits[point.n-1];
point.result -= result[point.n-1]*credits[point.n-1];
point.n--;
credits = point.credits;
result = point.result;
n = point.addpoint-1;
goto again;
}
}
EDIT:
Solved. Using knapsack problem code/integer linear programming by the application of glpk
The other way, which is practical for small examples like the one you've got, is to use dynamic programming. One can build up a table of "optimal score if you use the first k subjects, and want credits to add up to T". The code's a bit messy because C doesn't make it particularly easy to have dynamically sized 2d arrays, but here's a solution. Probably your professor was expecting something along these lines.
A small note, I divided all the credits (and the target credits of 120) by 10 because the common factor was redundant, but the code works just fine without that (it'll just use a tad more memory and time).
#include <stdio.h>
#include <stdlib.h>
int max(int a, int b) {
return a > b ? a : b;
}
#define GET(t, i, j, n) ((t)[(i) * (n + 1) + j])
// optimize_marks takes arrays creds and marks (both of length n),
// and finds a subset I of 0..(n-1) that maximizes
// sum(i in I)creds[i]*marks[i], such that sum(i in I)creds[i] = total.
void optimize_marks(size_t n, int *creds, int *marks, int total) {
// tbl[k * (total + 1) + T] stores the optimal score using only the
// first k subjects for a total credit score of T.
// tbl[n * (total + 1) + total] will be the final result.
// A score of -1 means that the result is impossible.
int *tbl = malloc((n + 1) * (total + 1) * sizeof(int));
for (int i = 0; i <= n; i++) {
for (int T = 0; T <= total; T++) {
if (i == 0) {
// With 0 subjects, the best score is 0 if 0 credits are
// required. If more than 0 credits are required, the result
// is impossible.
GET(tbl, i, T, total) = -(T > 0);
continue;
}
// One way to get T credits with the first i subjects is to
// get T credits with the first (i-1) subjects.
GET(tbl, i, T, total) = GET(tbl, i - 1, T, total);
// The other way is to use the marks for the i'th subject
// and get the rest of the credits with the first (i-1) subjects.
// We have to check that it's possible to use the first (i-1) subjects
// to get the remainder of the credits.
if (T >= creds[i-1] && GET(tbl, i - 1, T - creds[i-1], total) >= 0) {
// Pick the best of using and not using the i'th subject.
GET(tbl, i, T, total) = max(
GET(tbl, i, T, total),
GET(tbl, i - 1, T - creds[i-1], total) + marks[i-1] * creds[i-1]);
}
}
}
int T = total;
for (int i = n; i > 0; i--) {
if (GET(tbl, i - 1, T, total) < GET(tbl, i, T, total)) {
printf("%d %d %d\n", i, creds[i-1], marks[i-1]);
T -= creds[i-1];
}
}
}
int main(int argc, char *argv[]) {
int creds[] = {6, 2, 2, 2, 1, 2, 1, 1, 1, 2, 2, 1};
int marks[] = {48, 77, 46, 82, 85, 43, 49, 73, 65, 48, 47, 51};
optimize_marks(12, creds, marks, 12);
return 0;
}
The program gives the solution as the ILP program:
12 1 51
11 2 47
10 2 48
9 1 65
8 1 73
5 1 85
4 2 82
2 2 77
This is an integer linear programming problem. You want to find a vector [x1 x2 x3 ... x12] where each of the x's is either 0 or 1, such that sum(x[i] * cred[i]) = 120 and that sum(x[i] * cred[i] * marks[i]) is maximised.
There's a large body of research in how to solve such problems, but there's existing solvers out there and ready for you to use and using them is going to save you a vast amount of time over coding a solver up yourself.
Here's a module file for glpk, a free linear programming and integer linear programming solver. You can run it by installing glpk, saving this to a file then running glpsol -m <filename>
set I := 1..12;
var x{I} binary;
param cred{I};
param marks{I};
maximize s:
sum{i in I}(x[i] * cred[i] * marks[i]);
s.t. totalcreds: sum{i in I}(x[i] * cred[i]) = 120;
solve;
printf {i in I} "%d: %d %d %d\n", i, x[i], cred[i], marks[i];
data;
param cred := 1 60, 2 20, 3 20, 4 20, 5 10, 6 20, 7 10, 8 10, 9 10, 10 20, 11 20, 12 10;
param marks := 1 48, 2 77, 3 46, 4 82, 5 85, 6 43, 7 49, 8 73, 9 65, 10 48, 11 47, 12 51;
end;
The output is this:
1: 0 60 48
2: 1 20 77
3: 0 20 46
4: 1 20 82
5: 1 10 85
6: 0 20 43
7: 0 10 49
8: 1 10 73
9: 1 10 65
10: 1 20 48
11: 1 20 47
12: 1 10 51
That is, you should pick courses 2, 4, 5, 8, 9, 10, 11, 12 (ie: the courses with a '1' by them).
You should probably start by isolating all the combinaisons of credits[] elements whose sum equals 120 and store them before calculating all the means.
It would look like that :
// test for all possible combinaisons :
// [0] + [1] + [2]..., [1] + [0] + [2]...
// starting point on the array : n
while(n < 12){
// if the sum isnt equal to 120 and
// we havent covered all the array yet
if(temp < 120 && i < 12){
// if i is different from the starting point
if(i < 11 && i != n){
i++;
temp += credits[i];
// store i
}
}
if(temp >120 && i < 11){
temp -= credits[i];
// remove i
i++;
}
if(temp == 120){
// starting point ++
n++;
}
}
It's not optimal, but it could work.
I have solved a question that says:
Given a natural number n (1 <= n <= 500000), please output the summation of all its proper divisors.
Definition: A proper divisor of a natural number is the divisor that is strictly less than the number.
e.g. number 20 has 5 proper divisors: 1, 2, 4, 5, 10, and the divisor summation is: 1 + 2 + 4 + 5 + 10 = 22.
Input
An integer stating the number of test cases (equal to about 200000), and that many lines follow, each containing one integer between 1 and 500000 inclusive.
Output
One integer each line: the divisor summation of the integer given respectively.
Example
Sample Input:
3
2
10
20
Sample Output:
1
8
22
My code is as follows:
/* #BEGIN_OF_SOURCE_CODE */
#include <stdio.h>
#include <stdlib.h>
int main(int argc, const char * argv[])
{
int sum = 0,
cases = 0,
i, j, buff;
scanf("%d", &cases); //Number of tests
int *n;
n = (int*) malloc(cases * sizeof(int)); //Defining array for numbers to be tested///////
for (i = 0; i < cases; i++) {
scanf("%d", &n[i]);
}
for (i = 0; i < cases; i++ ) {
buff = n[i] / 2;
if (n[i] == 1) {
sum = -1;
}
if (!(n[i] & 1)) {
for (j = 2; j < buff; j++) {
if (n[i] % j == 0) {
sum += n[i] / j + j;
buff /= j;
}
}
}
else {
for (j = 3; j < buff; j += 2) {
if (n[i] % j == 0) {
if (n[i] / j == j) { sum += j; break; }
else sum += n[i] / j + j;
}
buff /= j;
}
}
printf("%d\n", ++sum);
sum = 0;
}
return 0;
}
/* #END_OF_SOURCE_CODE */
but it is not fast enough. Any suggestions?
I have updated the code below to terminate sooner. Running it for all integers from 1 to 500,000 takes under half a second on a MacBookPro6,1 (2.66 GHz Intel Core i7), compiled with Apple GCC 4.2.1 with -O3.
It uses the formula for σx(n) in the Properties section of the Wikipedia page for the divisor function. It could be made faster with a list of precalculated primes. (126 are needed to support inputs up to 500,000, and this reduces the time to less than a quarter of a second.) There are also some divisions that can be eliminated, at the expense of cluttering the code slightly.
// Return the least power of a that does not divide x.
static unsigned int LeastPower(unsigned int a, unsigned int x)
{
unsigned int b = a;
while (x % b == 0)
b *= a;
return b;
}
// Return the sum of the proper divisors of x.
static unsigned int SumDivisors(unsigned int x)
{
unsigned int t = x;
unsigned int result = 1;
// Handle two specially.
{
unsigned int p = LeastPower(2, t);
result *= p-1;
t /= p/2;
}
// Handle odd factors.
for (unsigned int i = 3; i*i <= t; i += 2)
{
unsigned int p = LeastPower(i, t);
result *= (p-1) / (i-1);
t /= p/i;
}
// At this point, t must be one or prime.
if (1 < t)
result *= 1+t;
return result - x;
}
You don't have to allocate space. Just do line by line.
For each line, there is an O( n ^ 1/2 ) algorithm.
#include <iostream>
using std::cout; using std::endl; using std::cin;
int main() {
int count, number;
cin >> count;
for (int i = 0; i < count; ++i) {
cin >> number;
int sum = 1;
for ( int j = 2; j * j <= number; ++j ) {
if ( number % j == 0 ) {
sum += j;
sum += number / j;
}
if ( j * j == number ) sum -= j; // recalculate twice
}
cout << sum << endl;
}
}
This is the runtime for 200,000 test case
real 0m55.420s
user 0m0.016s
sys 0m16.124s
I would start by NOT storing the numbers in an array at all. You don't need to - just read the value, process it, and output the result. The compiler may well not realize that n[i] is the same value throughout the loop, and that nothing else modifies it.
The logic doesn't seem very clear to me. And if (n[i] == 1) { sum = 1} else ... would make more sense than setting sum = -1.
You could perhaps also, keep a list of "common factors" (http://en.wikipedia.org/wiki/Memoization), so that you don't have to recalculate the same thing many times over. [If you know that somehing has the factor 24, then it also has 2, 3, 4, 6 and 8, for example.
I replied to a similar question on stackoverflow
There is a faster performing algorithm which is based on a formula for the sum of divisor using the decomposition in prime factors.
First you construct a primetable such that the last prime squared is smaller than the upper bound for your number. Then you apply the formula to each entry. If a number is written as
n = a1^p1 * a1^p2 *... *an^pn
the complexity of finding the sum for a given number n will be
p1+p2+...+pn = roughtly log(n)
which is better than the complexity O(sqrt(n)) of the first optimization which stop the loop early
Let's suppose you have a way to compute primes relatively quickly. This could be a one time upfront activity, bounded by the square root of the largest input value. In this case, you already know the bound of the largest input value (500000), so you can simply hard code a table of primes into the program.
static unsigned P[] = {
2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71,
73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151,
157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233,
239, 241, 251, 257, 263, 269, 271, 277, 281, 283, 293, 307, 311, 313, 317,
331, 337, 347, 349, 353, 359, 367, 373, 379, 383, 389, 397, 401, 409, 419,
421, 431, 433, 439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503,
509, 521, 523, 541, 547, 557, 563, 569, 571, 577, 587, 593, 599, 601, 607,
613, 617, 619, 631, 641, 643, 647, 653, 659, 661, 673, 677, 683, 691, 701
};
static int P_COUNT = sizeof(P)/sizeof(*P);
Now, from the primes, for each input value, you can:
Compute the prime factorization
Compute the product of the sums of the powers of each prime factor.
This will result in the sum of the divisors. Subtract the input value from the sum to obtain the sum of proper divisors. These two steps can be combined into a single loop.
This algorithm works because multiplying polynomials naturally results in sums of all combinations of the polynomial terms multiplied together. In the case where each polynomial term consists of powers of primes that divide the input, the combinations of the terms multiplied together make up the divisors. The algorithm is fast, and should be able to process 500000 numbers in the interval [1, 500000] in less than a second on a Core i3 or better processor.
The following function implements the method described above.
unsigned compute (unsigned n) {
unsigned sum = 1;
unsigned x = n;
for (int i = 0; i < P_COUNT; ++i) {
if (P[i] > x / P[i]) break; /* remaining primes won't divide x */
if (x % P[i] == 0) { /* P[i] is a divisor of n */
unsigned sub = P[i] + 1; /* add in power of P[i] */
x /= P[i]; /* reduce x by P[i] */
while (x % P[i] == 0) { /* while P[i] still divides x */
x /= P[i]; /* reduce x */
sub = sub * P[i] + 1; /* add by another power of P[i] */
}
sum *= sub; /* product of sums */
}
}
if (x > 1) sum *= x + 1; /* if x > 1, then x is prime */
return sum - n;
}
The complexity of this code in O(n * log(n)). But you can output the required answer with constant time.
int ans[500000 + 10], m = 500000;
int f(){
for(int i = 1; i <= m; i++){
for(int j = i + i; j <= m; j += i){
ans[j] += i;
}
}
}
Here ans is an array that contain sum of proper divisor from 2 to m.
I recently managed to build and run a simple CLAPACK Microsoft Visual Studio 2008 project (downloaded from http://icl.cs.utk.edu/lapack-for-windows/lapack/index.html). After that, inserting a single line after LAPACK dgesv_ call to initialize another integer tempInteger leads to the unsuccessful build. The error is: CLAPACK-EXAMPLE.c(30) : error C2143: syntax error : missing ';' before 'type'. It appears that execution of LAPACK function prevents certain actions such as variable initialization afterwards. Could anyone help me understand what's going on and fix it? Thanks in advance. The code listing is below:
#include < stdio.h>
#include "f2c.h"
#include "clapack.h"
int main(void)
{
/* 3x3 matrix A
* 76 25 11
* 27 89 51
* 18 60 32
*/
double A[9] = {76, 27, 18, 25, 89, 60, 11, 51, 32};
double b[3] = {10, 7, 43};
int N = 3;
int nrhs = 1;
int lda = 3;
int ipiv[3];
int ldb = 3;
int info;
int qqq = 1;
dgesv_(&N, &nrhs, A, &lda, ipiv, b, &ldb, &info);
if(info == 0) /* succeed */
printf("The solution is %lf %lf %lf\n", b[0], b[1], b[2]);
else
fprintf(stderr, "dgesv_ fails %d\n", info);
int tempInteger = 1;
return info;
}
If this file is compiled as C file and not C++ file then declaring tempInteger type should be done on th top of the function.
For example:
#include < stdio.h>
#include "f2c.h"
#include "clapack.h"
int main(void)
{
/* 3x3 matrix A
* 76 25 11
* 27 89 51
* 18 60 32
*/
double A[9] = {76, 27, 18, 25, 89, 60, 11, 51, 32};
double b[3] = {10, 7, 43};
int N = 3;
int nrhs = 1;
int lda = 3;
int ipiv[3];
int ldb = 3;
int info;
int qqq = 1;
int tempInteger;
dgesv_(&N, &nrhs, A, &lda, ipiv, b, &ldb, &info);
if(info == 0) /* succeed */
printf("The solution is %lf %lf %lf\n", b[0], b[1], b[2]);
else
fprintf(stderr, "dgesv_ fails %d\n", info);
tempInteger = 1;
return info;
}
Am implementing Sieve of Eratosthenes in CUDA and am having a very weird output. Am using unsigned char* as the data structure and using the following macros to manipulate the bits.
#define ISBITSET(x,i) ((x[i>>3] & (1<<(i&7)))!=0)
#define SETBIT(x,i) x[i>>3]|=(1<<(i&7));
#define CLEARBIT(x,i) x[i>>3]&=(1<<(i&7))^0xFF;
I set the bit to denote it's a prime number, otherwise it's = 0.
Here is where i call my kernel
size_t p=3;
size_t primeTill = 30;
while(p*p<=primeTill)
{
if(ISBITSET(h_a, p) == 1){
int dimA = 30;
int numBlocks = 1;
int numThreadsPerBlock = dimA;
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );
cudaThreadSynchronize();
reverseArrayBlock<<< dimGrid, dimBlock >>>( d_a, primeTill, p );
cudaThreadSynchronize();
cudaMemcpy( h_a, d_a, memSize, cudaMemcpyDeviceToHost );
cudaThreadSynchronize();
printf("This is after removing multiples of %d\n", p);
//Loop
for(size_t i = 0; i < primeTill +1; i++)
{
printf("Bit %d is %d\n", i, ISBITSET(h_a, i));
}
}
p++;
}
Here is my kernel
__global__ void reverseArrayBlock(unsigned char *d_out, int size, size_t p)
{
int id = blockIdx.x*blockDim.x + threadIdx.x;
int r = id*p;
if(id >= p && r <= size )
{
while(ISBITSET(d_out, r ) == 1 ){
CLEARBIT(d_out, r);
}
// if(r == 9)
// {
// /* code */
// CLEARBIT(d_out, 9);
// }
}
}
The output should be:
2, 3, 5, 7, 11, 13, 17, 19, 23, 29
while my output is:
2, 3, 5, 9, 7, 11, 13, 17, 19, 23, 29
If you take a look at the kernel code, if i uncomment those lines i will get the correct answer, which means that there is nothing wrong with my loops or my checking!
Multiple threads are accessing the same word (char) in global memory simultaneously and thus the written result gets corrupted.
You could use atomic operations to prevent this but the better solution would be to alter your algorithm: Instead of letting every thread sieve out multiples of 2, 3, 4, 5, ... let every thread check a range like [0..7], [8..15], ... so that every range's length is a multiple of 8 bits and no collisions occur.
I would suggest replacing the macros with methods to start with. You can use methods preceded by __host__ and __device__ to generate cpp and cu specific versions where necessary. That will eradicate the possibility of the pre-processor doing something unexpected.
Now just debug the particular code branch that is causing the wrong output, checking that each stage is correct in turn and you'll find the problem.