I saw this c code of using Sieve method of Eratosthenes to find primes, but I cannot extend it to even larger integers (for example, to 1000000000 and even larger) because of memory consumption to allocate such a large char array.
What would be the strategies to extend the code to larger numbers? Any references are also welcome.
Thanks.
The standard improvement to apply would be to treat each i-th bit as representing the number 2*i+1, thus representing odds only, cutting the size of the array in half. This would also entail, for each new prime p, starting the marking-off from p*p and incrementing by 2*p, to skip over evens. 2 itself is a special case. See also this question with a lot of answers.
Another strategy is to switch to the segmented sieve. That way you only need about pi(sqrt(m)) = 2*sqrt(m)/log(m) memory (m being your upper limit) set aside for the initial sequence of primes with which you'd sieve smaller fixed-sized array, sequentially representing segments of numbers. If you only need primes in some narrow far away range [m-d,m], you'd skip directly to sieving that range after all the needed primes have been gathered, as shown e.g. in this answer.
Per your specifics, to get primes up to 10^9 in value, working with one contiguous array is still possible. Using a bitarray for odds only, you'd need 10^9/16 bytes, i.e. about 60 MB of memory. Easier to work by segments; we only need 3402 primes, below 31627, to sieve each segment array below 10^9.
Exactly because of the size of the array required, the Sieve of Eratosthenes becomes impractical at some point. A modified sieve is common to find larger primes, (as explained on Wikipedia).
You could use gmp library. See Speed up bitstring/bit operations in Python? for the fast implementation of Sieve of Eratosthenes. It should be relatively easy to translate the provided solutions to C.
Related
I am trying to parallelize the radix sort using POSIX threads using C language. The specialty is the radix sort needs to be implemented for floating-point numbers. Currently, the code is running sequentially but I have no idea how to parallelize the code. Can anyone help me with this? Any help is appreciated.
Radix sorts are pretty hard to parallelize efficiently on CPUs. There is two parts in a radix sort: the creation of the histogram and the bucket filling.
To create an histogram in parallel you can fill local histograms in each thread and then perform a (tree-based) reduction of the histograms to build a global one. This strategy scale well as long as the histogram are small relative to the data chunks computed by each thread. An alternative way to parallelize this step is to use atomic adds to fill directly a shared histogram. This last method scale barely when thread write accesses conflicts (which often happens on small histograms and many threads). Note that in both solutions, the input array is evenly distributed between threads.
Regarding the bucket filling part, one solution is to make use of atomic adds to fill the buckets: 1 atomic counter per bucket is needed so that each thread can push back items safely. This solution only scale when threads do not often access to the same bucket (bucket conflicts). This solution is not great as the scalability of the algorithm is strongly dependent of the content of the input array (sequential in the worst case). There are solutions to reduces conflicts between threads (better scalability) at the expense of more work (slower with few threads). One is to fill the buckets from both sides: threads with an even ID fill the buckets in ascending order while threads with an odd ID fill them in descending order. Note that it is important to take into account false sharing to maximize performance.
A simple way to parallelize radix sort for all but the first pass is to use a most significant digit (MSD) pass to split up the array into bins, each of which can then be sorted concurrently. This approach relies on having somewhat uniform distribution of values, at least in terms of the most significant digit, so that the bins are reasonably equal in size.
For example, using a digit size of 8 bits (base 256), use a MSD pass to split up the array into 256 bins. Assuming there are t threads, then sort t bins at a time, using least significant digit first radix sort.
For larger arrays, it may help to use a larger initial digit size to split up the array into a larger number of bins, with the goal of getting t bins to fit in cache.
Link to a non-parallelized radix sort that uses MSD for first pass, then LSD for next 3 passes. The loop at the end of RadixSort() to sort the 256 bins could be parallelized:
Radix Sort Optimization
For the first pass, you could use the parallel method in Jerome Richard's answer, but depending on the data pattern, it may not help much, due to cache and memory conflicts.
I need to find the median of an array without sorting or copying the array.
The array is stored in the shared memory of a cuda program. Copying it to global memory would slow the program down and there is not enough space in shared memory to make an additional copy of it there.
I could use two 'for' loops and iterate over every possible value and count how many values are smaller than it but this would be O(n^2). Not ideal
Does anybody now of a O(n) or O(nlogn) algorithm which solves my problem?
Thanks.
If your input are integers with absolute value smaller than C, there's a simple O(n log C) algorithm that needs only constant additional memory: Just binary search for the answer, i.e. find the smallest number x such that x is larger than or equal to at least k elements in the array. It's easily parallelizable too via a parallel prefix scan to do the counting.
Your time and especially memory constraints make this problem difficult. It becomes easy, however, if you're able to use an approximate median.
Say an element y is an ε approximate median if
m/2 − ε m < rank(y) < m/2 + ε m
Then all you need to do is sample
t = 7ε−2
log(2δ
−1
)
elements, and find their median any way you want.
Note that the number of samples you need is independent of your array's size - it is just a function of ε and δ.
A little task on searching algorithm and complextiy in C. I just want to make sure im right.
I have n natural numbers from 1 to n+1 ordered from small to big, and i need to find the missing one.
For example: 1 2 3 5 6 7 8 9 10 11 - ans: 4
The fastest and the simple answer is do one loop and check every number with the number that comes after it. And the complexity of that is O(n) in the worst case.
I thought maybe i missing something and i can find it with using Binary Search. Can anybody think on more efficient algorithm in that simple example?
like O(log(n)) or something ?
There's obviously two answers:
If your problem is a purely theoretical problem, especially for large n, you'd do something like a binary search and check whether the middle between the two last boundaries is actually (upper-lower)/2.
However, if this is a practical question, for modern systems executing programs written in C and compiled by a modern, highly optimizing compiler for n << 10000, I'd assume that the linear search approach is much, much faster, simply because it can be vectorized so easily. In fact, modern CPUs have instructions to take e.g. each
4 integers at once, subtract four other integers,
compare the result to [4 4 4 4]
increment the counter by 4,
load the next 4 integers,
and so on, which very neatly lends itself to the fact that CPUs and memory controllers prefetch linear memory, and thus, jumping around in logarithmically descending step sizes can have an enormous performance impact.
So: For large n, where linear search would be impractical, go for the binary search approach; for n where that is questionable, go for the linear search. If you not only have SIMD capabilities but also multiple cores, you will want to split your problem. If your problem is not actually exactly 1 missing number, you might want to use a completely different approach ... The whole O(n) business is generally more of a benchmark usable purely for theoretical constructs, and unless the difference is immensely large, is rarely the sole reason to pick a specific algorithm in a real-world implementation.
For a comparison-based algorithm, you can't beat Lg(N) comparisons in the worst case. This is simply because the answer is a number between 1 and N and it takes Lg(N) bits of information to represent such a number. (And a comparison gives you a single bit.)
Unless the distribution of the answers is very skewed, you can't do much better than Lg(N) on average.
Now I don't see how a non-comparison-based method could exploit the fact that the sequence is ordered, and do better than O(N).
I'm struggling with the following problem:
Given n integers, place them into m bins, so that the total sum in all bins is minimized. The trick is that once numbers are placed in the bin, the total weight/cost/sum of the bin is computed in non-standard way:
weight_of_bin = Sigma - k * X Where Sigma is a sum of integers in the bin
k is the number of integers in the bin
X is the number of prime divisors that integers located in the bin have in common.
In other words, by grouping together the numbers that have many prime divisors in common, and by placing different quantities of numbers in different bins, we can achieve some "savings" in the total sum.
I use bin-packing formulation because I suspect the problem to be NPhard but I have trouble finding a proof. I am not a number theory person and am confused with the fact that weight of the bin depends on the items that are in the bin.
Are there hardness results for this type of problem?
P.S. I only know that the numbers are integers. There is no explicit limit on the largest integer involved in the problem.
Thanks for any pointers you can give.
This is not a complete answer, but hopefully it gives you some things to think about.
First, by way of clarification: what do you know about the prime divisors of the integers? Finding all the prime divisors of the integers in the input to the problem is difficult enough as it is. Factorization isn't known to be NP-complete, but it also isn't known to be in P. If you don't already know the factorization of the inputs, that might be enough to make this problem "hard".
In general, I expect this problem will be at least as hard as bin packing. A simple argument to show this is that it is possible that none of the integers given have any common prime divisors (for example, if you are given a set of distinct primes). In which case, the problem reduces to standard bin packing since the weight of the bin is just the standard weight. If you have a guarantee about how many common divisors there may be, you may possibly do better, but probably not in general.
There is a variant of bin packing, called VM packing (based on the idea of packing virtual machines based on memory requirements) where objects are allowed to share space (such as shared virtual memory pages). Your objective function, where you subtract a term based on "shared" prime divisors reminds me of that. Even in the case of VM packing, the problem is NP-hard. If the sharing has a nice hierarchy, good approximation algorithms exist, but they are still only approximations.
Just a quick question: What are people's practices when you have to define the (arbitrary) maximum that some array can take in C. So, some people just choose a round number hoping it will be big enough, others the prime number closer to the round number (!), etc., other some more esoteric number, like the prime number closer to... and so on.
I'm wondering, then, what are some best practices for deciding such values?
Thanks.
There is no general rule. Powers of twos work for buffers, I use 1024 quite often for string buffers in C but any other number would work. Prime numbers are useful for hash tables where simple modulo-hashing works well with prime-number sizes. Of course you define the size as a symbolic constant so that you can change it later.
If I can't pin down a reasonable maximum I tend to use malloc and realloc to grow the array as needed. Using a fixed size array when you can't gurantee that it is large enough for the intended purpose is hazardous.
Best practice is to avoid arbitrary limits whenever possible.
It's not always possible, so second-best practice is to take an educated estimate of the largest thing that the array is ever likely to need to hold, and then round up by a healthy margin, at least 25%. I tend to prefer powers of ten when I do this, because it makes it obvious on inspection that the number is an arbitrary limit. (Powers of two also often signify that, but only if the reader recognizes the number as a power of two, and most readers-of-code don't have that table memorized much past 216. If there's a good reason to use a power of two and it needs to be bigger than that, write it in hex. End of digression.) Always document the reasoning behind your estimate of the largest thing the array needs to hold, even if it's as simple as "anyone with a single source file bigger than 2GB needs to rethink their coding style" (actual example)
Don't use a prime number unless you specifically need the properties of a prime number (e.g. as Juho mentions, for hash tables -- but you only need that there if your hash function isn't very good -- but often it is, unfortunately.) When you do, document that you are intentionally using prime numbers and why, because most people do not recognize prime numbers on sight or know why they might be necessary in a particular situation.
If I need to do this I usually go with either a power of two, or for larger data sets, the number of pages required to hold the data. Most of the time though I prefer to allocate a chunk of memory on the heap and then realloc if the buffer size is insufficient later.
I only define a maximum when I have a strong reason for a particular number to be the maximum. Otherwise, I size it dynamically, perhaps with a sanity-check maximum (e.g. a person's name should not be several megabytes long).
Round numbers (powers of 2) are used because they are often easy for things like malloc to use (many implementations keep up with memory in blocks of various power of two sizes), easier for linkers to use (in the case of static or global arrays), and also because you can use bitwise operations to test for limits of them, which are often faster than < and >.
Prime numbers are used because using prime number sized hash tables is supposed to avoid collision.
Many people likely use both prime number and power of two sizes for things in cases where they don't actually provide any benefit, though.
It really isn't possible to predict at the outset what the maximum size could be.
For example, I coded a small cmdline interpreter, where each line of output produced was stored in a char array of size 200. Sufficient for all possible outputs, don't you think?
That was until I issued the env command which had a line with ~ 400 characters(!).
LS_COLORS='no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:bd=40;33;01:cd=40;33;01:or=01;
05;37;41:mi=01;05;37;41:ex=01;32:*.cmd=01;32:*.exe=01;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;
32:*.csh=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;
31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tz=01;31:*.rpm=01;31:*.cpio=01;31:*.jpg=01;35:*.gif=01;35:*.bmp=01;
35:*.xbm=01;35:*.xpm=01;35:*.png=01;35:*.tif=01;35:';
Moral of the story: Try to use dynamic allocation as far as possible.