I read here (page 14) that one way to improve the code efficiency for ARM-devices would be to use ARM_compiler specific keywords like __promiseand the folowing sample code is mentioned:
void f(int *x, int n)
{
int i;
__promise((n > 0) && ((n & 7) == 0)); /* My Question: How this line improves efficiency */
for(i = 0; i < n; i++)
{
x[i]++;
}
}
But I don't undrestand how this extra information (the loop index is positive and divisable by 8) improves the efficiency!
could anybody please explain how?
__promise((n > 0) && ((n & 7) == 0));
The binary representation of 7 is 0b111. This tells you that the last 3 bits of n will always be zero. Therefore n must be a multiple of 8. Furtermore n will also be greater than 0.
You thus promise the compiler, that it can safely unroll your loop into blocks of 8, and that it will do at least one iteration.
The compiler could therefore choose to rewrite your code as:
int i = 0;
do
{
x[i+0]++;
x[i+1]++;
x[i+2]++;
x[i+3]++;
x[i+4]++;
x[i+5]++;
x[i+6]++;
x[i+7]++;
} while ((i+=8) != n);
Which will skip a bunch of comparisons.
NEONs VADD instruction will even allow you to optimize this even further: vaddq_s32 would allow you to add the vector (1,1,1,1) to groups of four elements of your array at once. So your compiler could replace that block with two vadd instructions (if it feels you could benefit from that).
Related
I'm supposed to create a function to find the closest next prime number for a given number, I mean even though the algorithm is badly written and very slow (probably it's the slowest around), but its doing the task, the problem is the program supposed to evaluate my thing refuse it with timeout error, he had given it a bunch of numbers at once and he want them to be all resolved in 10 seconds, so the question is there any improvements you may suggest that could fast forward my poor tortue ? (for is not allowed)
int is_prime(int nb)
{
int i;
/* if negative terminate */
if (nb <= 1)
return (0);
/* start from first prime */
i = 2;
/* primes equals zero only when divisible by 1 and theme-selves */
while (nb % i != 0)
i++;
/* if i divides nb, we see if i is the nb we looking for */
if (i == nb)
return (1);
else
return (0);
}
int find_next_prime(int nb)
{
int i;
i = 0;
/*keep looking for primes one by one */
while (!is_prime(nb + i))
i++;
return (nb + i);
}
The best simple speed improvement is to check divisors up to the square root of n rather than all divisors up to n. This takes the algorithm from O(nb) to O(sqrt(nb)).
Consider is_prime(2147483647). OP's approach takes about 2147483647 iterations. Testing up to about the square root, 46341, is about 46,000 times faster.
// while (nb % i != 0) i++;
// if (i == nb) return (1);
// else return (0);
// While the divisor <= quotient (or until the square root is reached)
while (i <= nb/i) {
if (nb%i == 0) return 1;
i++;
}
return 0;
Avoid a i*i <= nb test as i*i may overflow.
Avoid sqrt(nb) as than involves a host of floating point/int issues.
Note: Good compilers see nearby nb/i; nb%i and compute them both for the time cost of one.
Lots of other improvements are possible, but wanted to focus on a simple one with a big impact. When wanting to improve speed focus on reducing order of complexity O() and not linear improvements. Is premature optimization really the root of all evil?
Your missing the two most common tricks to improve speed.
you only need to check up to square root of the number
once you have check 2, you only need to check every other number from 3 on.
Hi: I have been ramping up on C and I have a couple philosophical questions based on arrays and pointers and how make things simple, quick, and small or balance the three at least, I suppose.
I imagine an MCU sampling an input every so often and storing the sample in an array, called "val", of size "NUM_TAPS". The index of 'val' gets decremented for the next sample after the current, so for instance if val[0] just got stored, the next value needs to go into val[NUM_TAPS-1].
At the end of the day I want to be able to refer to the newest sample as x[0] and the oldest sample as x[NUM_TAPS-1] (or equivalent).
It is a slightly different problem than many have solved on this and other forums describing rotating, circular, queue etc. buffers. I don't need (I think) a head and tail pointer because I always have NUM_TAPS data values. I only need to remap the indexes based on a "head pointer".
Below is the code I came up with. It seems to be working fine but it raises a few more questions I'd like to pose to the wider, much more expert community:
Is there a better way to assign indexes than a conditional assignment
(to wrap indexes < 0) with the modulus operator (to wrap indexes >
NUM_TAPS -1)? I can't think of a way that pointers to pointers would
help, but does anyone else have thoughts on this?
Instead of shifting the data itself as in a FIFO to organize the
values of x, I decided here to rotate the indexes. I would guess that
for data structures close to or smaller in size than the pointers
themselves that data moves might be the way to go but for very large
numbers (floats, etc.) perhaps the pointer assignment method is the
most efficient. Thoughts?
Is the modulus operator generally considered close in speed to
conditional statements? For example, which is generally faster?:
offset = (++offset)%N;
*OR**
offset++;
if (NUM_TAPS == offset) { offset = 0; }
Thank you!
#include <stdio.h>
#define NUM_TAPS 10
#define STARTING_VAL 0
#define HALF_PERIOD 3
void main (void) {
register int sample_offset = 0;
int wrap_offset = 0;
int val[NUM_TAPS];
int * pval;
int * x[NUM_TAPS];
int live_sample = 1;
//START WITH 0 IN EVERY LOCATION
pval = val; /* 1st address of val[] */
for (int i = 0; i < NUM_TAPS; i++) { *(pval + i) = STARTING_VAL ; }
//EVENT LOOP (SAMPLE A SQUARE WAVE EVERY PASS)
for (int loop = 0; loop < 30; loop++) {
if (0 == loop%HALF_PERIOD && loop > 0) {live_sample *= -1;}
*(pval + sample_offset) = live_sample; //really stupid square wave generator
//assign pointers in 'x' based on the starting offset:
for (int i = 0; i < NUM_TAPS; i++) { x[i] = pval+(sample_offset + i)%NUM_TAPS; }
//METHOD #1: dump the samples using pval:
//for (int i = 0; i < NUM_TAPS; i++) { printf("%3d ",*(pval+(sample_offset + i)%NUM_TAPS)); }
//printf("\n");
//METHOD #2: dump the samples using x:
for (int i = 0; i < NUM_TAPS; i++) { printf("%3d ",*x[i]); }
printf("\n");
sample_offset = (sample_offset - 1)%NUM_TAPS; //represents the next location of the sample to be stored, relative to pval
sample_offset = (sample_offset < 0 ? NUM_TAPS -1 : sample_offset); //wrap around if the sample_offset goes negative
}
}
The cost of a % operator is the about 26 clock cycles since it is implemented using the DIV instruction. An if statement is likely faster since the instructions will be present in the pipeline and so the process will skip a few instructions but it can do this quickly.
Note that both solutions are slow compared to doing a BITWISE AND operation which takes only 1 clock cycle. For reference, if you want gory detail, check out this chart for the various instruction costs (measured in CPU Clock ticks)
http://www.agner.org/optimize/instruction_tables.pdf
The best way to do a fast modulo on a buffer index is to use a power of 2 value for the number of buffers so then you can use the quick BITWISE AND operator instead.
#define NUM_TAPS 16
With a power of 2 value for the number of buffers, you can use a bitwise AND to implement modulo very efficiently. Recall that bitwise AND with a 1 leaves the bit unchanged, while bitwise AND with a 0 leaves the bit zero.
So by doing a bitwise AND of NUM_TAPS-1 with your incremented index, assuming that NUM_TAPS is 16, then it will cycle through the values 0,1,2,...,14,15,0,1,...
This works because NUM_TAPS-1 equals 15, which is 00001111b in binary. The bitwise AND resulst in a value where only that last 4 bits to be preserved, while any higher bits are zeroed.
So everywhere you use "% NUM_TAPS", you can replace it with "& (NUM_TAPS-1)". For example:
#define NUM_TAPS 16
...
//assign pointers in 'x' based on the starting offset:
for (int i = 0; i < NUM_TAPS; i++)
{ x[i] = pval+(sample_offset + i) & (NUM_TAPS-1); }
Here is your code modified to work with BITWISE AND, which is the fastest solution.
#include <stdio.h>
#define NUM_TAPS 16 // Use a POWER of 2 for speed, 16=2^4
#define MOD_MASK (NUM_TAPS-1) // Saves typing and makes code clearer
#define STARTING_VAL 0
#define HALF_PERIOD 3
void main (void) {
register int sample_offset = 0;
int wrap_offset = 0;
int val[NUM_TAPS];
int * pval;
int * x[NUM_TAPS];
int live_sample = 1;
//START WITH 0 IN EVERY LOCATION
pval = val; /* 1st address of val[] */
for (int i = 0; i < NUM_TAPS; i++) { *(pval + i) = STARTING_VAL ; }
//EVENT LOOP (SAMPLE A SQUARE WAVE EVERY PASS)
for (int loop = 0; loop < 30; loop++) {
if (0 == loop%HALF_PERIOD && loop > 0) {live_sample *= -1;}
*(pval + sample_offset) = live_sample; //really stupid square wave generator
//assign pointers in 'x' based on the starting offset:
for (int i = 0; i < NUM_TAPS; i++) { x[i] = pval+(sample_offset + i) & MOD_MASK; }
//METHOD #1: dump the samples using pval:
//for (int i = 0; i < NUM_TAPS; i++) { printf("%3d ",*(pval+(sample_offset + i) & MOD_MASK)); }
//printf("\n");
//METHOD #2: dump the samples using x:
for (int i = 0; i < NUM_TAPS; i++) { printf("%3d ",*x[i]); }
printf("\n");
// sample_offset = (sample_offset - 1)%NUM_TAPS; //represents the next location of the sample to be stored, relative to pval
// sample_offset = (sample_offset < 0 ? NUM_TAPS -1 : sample_offset); //wrap around if the sample_offset goes negative
// MOD_MASK works faster than the above
sample_offset = (sample_offset - 1) & MOD_MASK;
}
}
At the end of the day I want to be able to refer to the newest sample as x[0] and the oldest sample as x[NUM_TAPS-1] (or equivalent).
Any way you implement this is very expensive, because each time you record a new sample, you have to move all the other samples (or pointers to them, or an equivalent). Pointers don't really help you here. In fact, using pointers as you do is probably a little more costly than just working directly with the buffer.
My suggestion would be to give up the idea of "remapping" indices persistently, and instead do it only virtually, as needed. I'd probably ease that and ensure it is done consistently by writing data access macros to use in place of direct access to the buffer. For example,
// expands to an expression designating the sample at the specified
// (virtual) index
#define SAMPLE(index) (val[((index) + sample_offset) % NUM_TAPS])
You would then use SAMPLE(n) instead of x[n] to read the samples.
I might consider also providing a macro for adding new samples, such as
// Updates sample_offset and records the given sample at the new offset
#define RECORD_SAMPLE(sample) do { \
sample_offset = (sample_offset + NUM_TAPS - 1) % NUM_TAPS; \
val[sample_offset] = sample; \
} while (0)
With regard to your specific questions:
Is there a better way to assign indexes than a conditional assignment (to wrap indexes < 0) with the modulus operator (to wrap
indexes > NUM_TAPS -1)? I can't think of a way that pointers to
pointers would help, but does anyone else have thoughts on this?
I would choose modulus over a conditional every time. Do, however, watch out for taking the modulus of a negative number (see above for an example of how to avoid doing so); such a computation may not mean what you think it means. For example -1 % 2 == -1, because C specifies that (a/b)*b + a%b == a for any a and b such that the quotient is representable.
Instead of shifting the data itself as in a FIFO to organize the values of x, I decided here to rotate the indexes. I would guess that
for data structures close to or smaller in size than the pointers
themselves that data moves might be the way to go but for very large
numbers (floats, etc.) perhaps the pointer assignment method is the
most efficient. Thoughts?
But your implementation does not rotate the indices. Instead, it shifts pointers. Not only is this about as expensive as shifting the data themselves, but it also adds the cost of indirection for access to the data.
Additionally, you seem to have the impression that pointer representations are small compared to representations of other built-in data types. This is rarely the case. Pointers are usually among the largest of a given C implementation's built-in data types. In any event, neither shifting around the data nor shifting around pointers is efficient.
Is the modulus operator generally considered close in speed to conditional statements? For example, which is generally faster?:
On modern machines, the modulus operator is much faster on average than a conditional whose result is difficult for the CPU to predict. CPUs these days have long instruction pipelines, and they perform branch prediction and corresponding speculative computation to enable them to keep these full when a conditional instruction is encountered, but when they discover that they have predicted incorrectly, they need to flush the whole pipeline and redo several computations. When that happens, it's a lot more expensive than a small number of unconditional arithmetical operations.
I tried listing prime numbers up to 2 billion, using Sieve Eratosthenes method. Here is what I used!
The problem I am facing is, I am not able to go beyond 10 million numbers. When I tried, it says 'Segmentation Fault'. I searched in the Internet to find the cause. Some sites say, it is the memory allocation limitation of the compiler itself. Some say, it is a hardware limitation. I am using a 64-bit processor with 4GB of RAM installed. Please suggest me a way to list them out.
#include <stdio.h>
#include <stdlib.h>
#define MAX 1000000
long int mark[MAX] = {0};
int isone(){
long int i;
long int cnt = 0;
for(i = 0; i < MAX ; i++){
if(mark[i] == 1)
cnt++;
}
if(cnt == MAX)
return 1;
else
return 0;
}
int main(int argc,char* argv[]){
long int list[MAX];
long int s = 2;
long int i;
for(i = 0 ; i < MAX ; i++){
list[i] = s;
s++;
}
s = 2;
printf("\n");
while(isone() == 0){
for(i = 0 ; i < MAX ; i++){
if((list[i] % s) == 0)
mark[i] = 1;
}
printf(" %lu ",s);
while(mark[++s - 2] != 0);
}
return 1;
}
long int mark[1000000] does stack allocation, which is not what you want for that amount of memory. try long int *mark = malloc(sizeof(long int) * 1000000) instead to allocate heap memory. This will get you beyond ~1Mil of array elements.
remember to free the memory, if you don't use it anymore. if yon don't know malloc or free, read the manpages (manuals) for the functions, available via man 3 malloc and man 3 free on any linux terminal. (alternatively you could just google man malloc)
EDIT: make that calloc(1000000, sizeof(long int)) to have a 0-initialized array, which is probably better.
Additionally, you can use every element of your array as a bitmask, to be able to store one mark per bit, and not per sizeof(long int) bytes. I'd recommend using a fixed-width integer type, like uint32_t for the array elements and then setting the (n % 32)'th bit in the (n / 32)'th element of the array to 1 instead of just setting the nth element to 1.
you can set the nth bit of an integer i by using:
uint32_t i = 0;
i |= ((uint32_t) 1) << n
assuming you start counting at 0.
that makes your set operation on the uint32_t bitmask array for a number n:
mask[n / 32] |= ((uint32_t)1) << (n % 32)
that saves you >99% of memory for 32bit types. Have fun :D
Another, more advanced approach to use here is prime wheel factorization, which basically means that you declare 2,3 and 5 (and possibly even more) as prime beforehand, and use only numbers that are not divisible by one of these in your mask array. But that's a really advanced concept.
However, I have written a primesieve wich wheel factorization for 2 and 3 in C in about ~15 lines of code (also for projecteuler) so it is possible to implement this stuff efficiently ;)
The most immediate improvement is to switch to bits representing the odd numbers. Thus to cover the M=2 billion numbers, or 1 billion odds, you need 1000/8 = 125 million bytes =~ 120 MB of memory (allocate them on heap, still, with the calloc function).
The bit at position i will represent the number 2*i+1. Thus when marking the multiples of a prime p, i.e. p^2, p^2+2p, ..., M, we have p^2=(2i+1)^2=4i^2+4i+1 represented by a bit at the position j=(p^2-1)/2=2i(i+1), and next multiples of p above it at position increments of p=2i+1,
for( i=1; ; ++i )
if( bit_not_set(i) )
{
p=i+i+1;
k=(p-1)*(i+1);
if( k > 1000000000) break;
for( ; k<1000000000; k+=p)
set_bit(k); // mark as composite
}
// all bits i>0 where bit_not_set(i) holds,
// represent prime numbers 2i+1
Next step is to switch to working in smaller segments that will fit in your cache size. This should speed things up. You will only need to reserve memory region for primes under the square root of 2 billion in value, i.e. 44721.
First, sieve this smaller region to find the primes there; then write these primes into a separate int array; then use this array of primes to sieve each segment, possibly printing the found primes to stdout or whatever.
There are 2 very big series of elements, the second 100 times bigger than the first. For each element of the first series, there are 0 or more elements on the second series. This can be traversed and processed with 2 nested loops. But the unpredictability of the amount of matching elements for each member of the first array makes things very, very slow.
The actual processing of the 2nd series of elements involves logical and (&) and a population count.
I couldn't find good optimizations using C but I am considering doing inline asm, doing rep* mov* or similar for each element of the first series and then doing the batch processing of the matching bytes of the second series, perhaps in buffers of 1MB or something. But the code would be get quite messy.
Does anybody know of a better way? C preferred but x86 ASM OK too. Many thanks!
Sample/demo code with simplified problem, first series are "people" and second series are "events", for clarity's sake. (the original problem is actually 100m and 10,000m entries!)
#include <stdio.h>
#include <stdint.h>
#define PEOPLE 1000000 // 1m
struct Person {
uint8_t age; // Filtering condition
uint8_t cnt; // Number of events for this person in E
} P[PEOPLE]; // Each has 0 or more bytes with bit flags
#define EVENTS 100000000 // 100m
uint8_t P1[EVENTS]; // Property 1 flags
uint8_t P2[EVENTS]; // Property 2 flags
void init_arrays() {
for (int i = 0; i < PEOPLE; i++) { // just some stuff
P[i].age = i & 0x07;
P[i].cnt = i % 220; // assert( sum < EVENTS );
}
for (int i = 0; i < EVENTS; i++) {
P1[i] = i % 7; // just some stuff
P2[i] = i % 9; // just some other stuff
}
}
int main(int argc, char *argv[])
{
uint64_t sum = 0, fcur = 0;
int age_filter = 7; // just some
init_arrays(); // Init P, P1, P2
for (int64_t p = 0; p < PEOPLE ; p++)
if (P[p].age < age_filter)
for (int64_t e = 0; e < P[p].cnt ; e++, fcur++)
sum += __builtin_popcount( P1[fcur] & P2[fcur] );
else
fcur += P[p].cnt; // skip this person's events
printf("(dummy %ld %ld)\n", sum, fcur );
return 0;
}
gcc -O5 -march=native -std=c99 test.c -o test
Since on average you get 100 items per person, you can speed things up by processing multiple bytes at a time. I re-arranged the code slightly in order to use pointers instead of indexes, and replaced one loop by two loops:
uint8_t *p1 = P1, *p2 = P2;
for (int64_t p = 0; p < PEOPLE ; p++) {
if (P[p].age < age_filter) {
int64_t e = P[p].cnt;
for ( ; e >= 8 ; e -= 8) {
sum += __builtin_popcountll( *((long long*)p1) & *((long long*)p2) );
p1 += 8;
p2 += 8;
}
for ( ; e ; e--) {
sum += __builtin_popcount( *p1++ & *p2++ );
}
} else {
p1 += P[p].cnt;
p2 += P[p].cnt;
}
}
In my testing this speeds up your code from 1.515s to 0.855s.
The answer by Neil doesn't require sorting by age, which btw could be a good idea --
If the second loop has holes (please correct original source code to support that idea), a common solution is to do cumsum[n+1]=cumsum[n]+__popcount(P[n]&P2[n]);
Then for each people
sum+=cumsum[fcur + P[p].cnt] - cumsum[fcur];
Anyway it seems that the computational burden is merely of order EVENTS, not EVENTS*PEOPLE. Some optimization can anyway take place by calling the inner loop for all the consecutive people meeting the condition.
If there are really max 8 predicates, it could makes sense to precalculate all the
sums (_popcounts(predicate[0..255])) for each people into separate arrays C[256][PEOPLE]. That just about doubles the memory requirements (on disk?), but localizes the search from 10GB+10GB+...+10GB (8 predicates) to one stream of 200MB (assuming 16 bit entries).
Depending on the probability of p(P[i].age < condition && P[i].height < cond2), it may not anymore make sense to calculate cumulative sums. Maybe, maybe not. More likely just some SSE parallelism 8 or 16 people at a time will do.
A completely new approach could be to use ROBDDs to encode the truth tables of each person / each event. First, if the event tables are not very random or if they do not consists of pathological functions, such as truth tables of bignum multiplication, then first one may achieve compression of the functions and secondly arithmetic operations for truth tables can be calculated in compressed form. Each subtree can be shared between users and each arithmetic operation for two identical subtrees has to be calculated only once.
I don't know if your sample code accurately reflects your problem but it can be rewritten like this:
for (int64_t p = 0; p < PEOPLE ; p++)
if (P[p].age < age_filter)
fcur += P[p].cnt;
for (int64_t e = 0; e < fcur ; e++)
sum += __builtin_popcount( P1[e] & P2[e] );
I don't know about gcc -O5 (it seems not documented here) and seems to produce the exact same code as gcc -O3 here with my gcc 4.5.4 (though, only tested on a relatively small code sample)
depending on what you want to achieve, -O3 can be slower than -O2
as with your problem, I'd suggest thinking more about your data structure than the actual algorithm.
you should not focus on solving the problem with an adequate algorithm/code optimisation as long as your data aren't repsented in a convenient manner.
if you want to quickly cut a large set of your data based on a single criteria (here, age in your example) I'd recommand using a variant of a sorted tree.
If your actual data(age,count etc.) is indeed 8-bit there is probably a lot of redundancy in calculations. In this case you can replace the processing by lookup tables - for each 8-bit value you'll have 256 possible outputs and instead of computation it might be possible to read the computed data from the table.
To tackle the branch mispredictions (missing in other answers) the code could do something like:
#ifdef MISPREDICTIONS
if (cond)
sum += value
#else
mask = - (cond == 0); // cond: 0 then -0, binary 00..; cond: 1 then -1, binary 11..
sum += (value & mask); // if mask is 0 sum value, else sums 0
#endif
It's not completely free since there are data dependencies (think superscalar cpu). But it usually gets a 10x boost for mostly unpredictable conditions.
I'm working on Project Euler #14 in C and have figured out the basic algorithm; however, it runs insufferably slow for large numbers, e.g. 2,000,000 as wanted; I presume because it has to generate the sequence over and over again, even though there should be a way to store known sequences (e.g., once we get to a 16, we know from previous experience that the next numbers are 8, 4, 2, then 1).
I'm not exactly sure how to do this with C's fixed-length array, but there must be a good way (that's amazingly efficient, I'm sure). Thanks in advance.
Here's what I currently have, if it helps.
#include <stdio.h>
#define UPTO 2000000
int collatzlen(int n);
int main(){
int i, l=-1, li=-1, c=0;
for(i=1; i<=UPTO; i++){
if( (c=collatzlen(i)) > l) l=c, li=i;
}
printf("Greatest length:\t\t%7d\nGreatest starting point:\t%7d\n", l, li);
return 1;
}
/* n != 0 */
int collatzlen(int n){
int len = 0;
while(n>1) n = (n%2==0 ? n/2 : 3*n+1), len+=1;
return len;
}
Your original program needs 3.5 seconds on my machine. Is it insufferably slow for you?
My dirty and ugly version needs 0.3 seconds. It uses a global array to store the values already calculated. And use them in future calculations.
int collatzlen2(unsigned long n);
static unsigned long array[2000000 + 1];//to store those already calculated
int main()
{
int i, l=-1, li=-1, c=0;
int x;
for(x = 0; x < 2000000 + 1; x++) {
array[x] = -1;//use -1 to denote not-calculated yet
}
for(i=1; i<=UPTO; i++){
if( (c=collatzlen2(i)) > l) l=c, li=i;
}
printf("Greatest length:\t\t%7d\nGreatest starting point:\t%7d\n", l, li);
return 1;
}
int collatzlen2(unsigned long n){
unsigned long len = 0;
unsigned long m = n;
while(n > 1){
if(n > 2000000 || array[n] == -1){ // outside range or not-calculated yet
n = (n%2 == 0 ? n/2 : 3*n+1);
len+=1;
}
else{ // if already calculated, use the value
len += array[n];
n = 1; // to get out of the while-loop
}
}
array[m] = len;
return len;
}
Given that this is essentially a throw-away program (i.e. once you've run it and got the answer, you're not going to be supporting it for years :), I would suggest having a global variable to hold the lengths of sequences already calculated:
int lengthfrom[UPTO] = {};
If your maximum size is a few million, then we're talking megabytes of memory, which should easily fit in RAM at once.
The above will initialise the array to zeros at startup. In your program - for each iteration, check whether the array contains zero. If it does - you'll have to keep going with the computation. If not - then you know that carrying on would go on for that many more iterations, so just add that to the number you've done so far and you're done. And then store the new result in the array, of course.
Don't be tempted to use a local variable for an array of this size: that will try to allocate it on the stack, which won't be big enough and will likely crash.
Also - remember that with this sequence the values go up as well as down, so you'll need to cope with that in your program (probably by having the array longer than UPTO values, and using an assert() to guard against indices greater than the size of the array).
If I recall correctly, your problem isn't a slow algorithm: the algorithm you have now is fast enough for what PE asks you to do. The problem is overflow: you sometimes end up multiplying your number by 3 so many times that it will eventually exceed the maximum value that can be stored in a signed int. Use unsigned ints, and if that still doesn't work (but I'm pretty sure it does), use 64 bit ints (long long).
This should run very fast, but if you want to do it even faster, the other answers already addressed that.