using linked list vs large array for searching - c

I am implementing some algorithmic changes to the conventional game of life for an assignment.
Essentially, I currently have two options for implementing a multithreaded searching algorithm that improves on the efficiency of a previous algorithm.
Either search through a linked list using two threads and relay the data to two other threads to process(application is running on a quad core)
To have a massive preallocated array which will remain largely empty containing only pointers to predefined structures, in which case the searching could be done much faster and there would be no issues in syncing the threads.
Would a faster search outweigh memory requirements and reduce computing time?
It should be mentioned that the array will remain largely empty, but the overall memory allocated to it would be far larger than the linked list, not to mention the index of the furthest most nonempty array element could also be stored so as to prevent the program from searching an entire array.
I should also mention that the array stores pointers to live cell coordinates, and as such is only kept so large as a worst case measure. I am also planning on ignoring any NULL values in order to skip array elements who have been deleted.

Game Of Life and searching?????
If you want a multithreaded Game Of Life, calculate line n/2 on its own but don't store it in the array, just in a buffer, run two threads that calculate and store lines 0 to n/2 - 1 resp. lines n/2 + 1 to n-1, then copy the line n/2 into the result.
For four threads, calculate lines at n/4, n/2, 3n/4 first, give each thread a quarter of the job, then copy the three lines into the array.

If your array is as sparse as most GOL boards, then the list will likely be much, much faster. Having a pointer to the next piece of data is way better than scanning for it.
That said, the overall performance may not be better, as others have mentioned.

Related

what is more expensive: compare or accessing an index of array

basically i saw i video on youtube that visualized sorting algorithms and they provided the program so that we can play with it .. and the program counts two main things (comparisons , array accesses) .. i wanted to see which one of (merge & quick) sort is the fastest ..
for 100 random numbers
quick sort:
comparisons 1000
array accesses 1400
merge sort:
comparisons 540
array accesses 1900
so quick sort uses less array access while merge sort uses less comparisons and the difference increases with the number of the indexes .. so which one of those is harder for computer to do?
The numbers are off. Results from actual runs with 100 random numbers. Note that quick sort compare count is affected by the implementation, Hoare uses less compares than Lomuto.
quick sort (Hoare partition scheme)
pivot reads 87 (average)
compares 401 (average)
array accesses 854 (average)
merge sort:
compares 307 (average)
array accesses 1400 (best, average, worst)
Since numbers are being sorted, I'm assuming they fit in registers, which reduces the array accesses.
For quick sort, the compares are done versus a pivot value, which should be read just once per recursive instance of quick sort and placed in a register, then one read for each value compared. An optimizing compiler may keep the values used for compare in registers so that swaps already have the two values in registers and only need to do two writes.
For merge sort, the compares add almost zero overhead to the array accesses, since the compared values will be read into registers, compared, then written from the registers instead of reading from memory again.
Sorting performance depends on many conditions, I think answering your exact question won't lead to a helpful answer (you can benchmark it easily yourself).
Sorting a small number of elements is usually not time critical, benchmarking makes sense for larger lists.
Also it is a rare case to sort an array of integers, it is much more common to sort a list of objects, comparing one or more of their properties.
If you head for performance, think about multi threading.
MergeSort is stable (equal elements keep their relative position), QuickSort is not, so you are comparing different results.
In your example, the quicksort algorithm is probably faster most of the time. If the comparison is more complex, e.g. string instead of int or multiple fields, MergeSort will become more and more effective because it needs fewer (expensive) comparisons. If you want to parallize the sorting, MergeSort is predestined because of the algorithm itself.

Cache Optimization - Hashmap vs QuickSort?

Suppose that I have N unsorted arrays, of integers. I'd like to find the intersection of those arrays.
There are two good ways to approach this problem.
One, I can sort the arrays in place with an nlogn sort, like QuickSort or MergeSort. Then I can put a pointer at the start of each array. Compare each array to the one below it, iterating the pointer of whichever array[pointer] is smaller, or if they're all equal, you've found an intersection.
This is an O(nlogn) solution, with constant memory (since everything is done in-place).
The second solution is to use a hash map, putting in the values that appear in the first array as keys, and then incrementing those values as you traverse through the remaining arrays (and then grabbing everything that had a value of N). This is an O(n) solution, with O(n) memory, where n is the total size of all of the arrays.
Theoretically, the former solution is o(nlogn), and the latter is O(n). However, hash maps do not have great locality, due to the way that items can be randomly scattered through the map, due to collisions. The other solution, although o(nlogn), traverses through the array one at a time, exhibiting excellent locality. Since a CPU will tend to pull the array values from memory that are next to the current index into the cache, the O(nlogn) solution will be hitting the cache much more often than the hash map solution.
Therefore, given a significantly large array size (as number of elements goes to infinity), is it feasible that the o(nlogn) solution is actually faster than the O(n) solution?
For integers you can use a non-comparison sort (see counting, radix sort). A large set might be encoded, e.g. sequential runs into ranges. That would compress the data set and allow for skipping past large blocks (see RoaringBitmaps). There is the potential to be hardware friendly and have O(n) complexity.
Complexity theory does not account for constants. As you suspect there is always the potential for an algorithm with a higher complexity to be faster than the alternative, due to the hidden constants. By exploiting the nature of the problem, e.g. limiting the solution to integers, there are potential optimizations not available to general purpose approach. Good algorithm design often requires understanding and leveraging those constraints.

Transferring large variable amount of memory from Cuda

Cuda is awesome and I'm using it like crazy but I am not using her full potential because I'm having a issue transferring memory and was wondering if there was a better way to get a variable amount of memory out. Basically I send 65535 item array into Cuda and Cuda analyzes each data item around 20,000 different ways and if there's a match in my programs logic then it saves a 30 int list as a result. Think of my logic of analyzing each different combination and then looking at the total and if the total is equal to a number I'm looking for then it saves the results (which is a 30 int list for each analyzed item).
The problem is 65535 (blocks/items in data array) * 20000 (total combinations tested per item) = 1,310,700,000. This means I need to create a array of that size to deal with the chance that all the data will be a positive match (which is extremely unlikely and creating int output[1310700000][30] seems crazy for memory). I've been forced to make it smaller and send less blocks to process because I don't know how if Cuda can write efficiently to a linked list or a dynamically sized list (with this approach the it writes the output to host memory using block * number_of_different_way_tests).
Is there a better way to do this? Can Cuda somehow write to free memory that is not derived from the blockid? When I test this process on the CPU, less then 10% of the item array have a positive match so its extremely unlikely I'll use so much memory each time I send work to the kernel.
p.s. I'm looking above and although it's exactly what I'm doing, if it's confusing then another way of thinking about it (not exactly what I'm doing but good enough to understand the problem) is I am sending 20,000 arrays (that each contain 65,535 items) and adding each item with its peer in the other arrays and if the total equals a number (say 200-210) then I want to know the numbers it added to get that matching result.
If the numbers are very widely range then not all will match but using my approach I'm forced to malloc that huge amount of memory. Can I capture the results with mallocing less memory? My current approach to is malloc as much as I have free but I'm forced to run less blocks which isn't efficient (I want to run as many blocks and threads a time because I like the way Cuda organizes and runs the blocks). Is there any Cuda or C tricks I can use for this or I'm a stuck with mallocing the max possible results (and buying a lot more memory)?
As Per Roger Dahl's great answer:
The functionality you're looking for is called stream compaction.
You probably do need to provide an array that contains room for 4 solutions per thread because attempting to directly store the results in a compact form is likely to create so many dependencies between the threads that the performance gained in being able to copy less data back to the host is lost by a longer kernel execution time. The exception to this is if almost all of the threads find no solutions. In that case, you might be able to use an atomic operation to maintain an index into an array. So, for each solution that is found, you would store it in an array at an index and then use an atomic operation to increase the index. I think it would be safe to use atomicAdd() for this. Before storing a result, the thread would use atomicAdd() to increase the index by one. atomicAdd() returns the old value, and the thread can store the result using the old value as the index.
However, given a more common situation, where there's a fair number of results, the best solution will be to perform a compacting operation as a separate step. One way to do this is with thrust::copy_if. See this question for some more background.

A Memory-Adaptive Merge Algorithm?

Many algorithms work by using the merge algorithm to merge two different sorted arrays into a single sorted array. For example, given as input the arrays
1 3 4 5 8
and
2 6 7 9
The merge of these arrays would be the array
1 2 3 4 5 6 7 8 9
Traditionally, there seem to be two different approaches to merging sorted arrays (note that the case for merging linked lists is quite different). First, there are out-of-place merge algorithms that work by allocating a temporary buffer for storage, then storing the result of the merge in the temporary buffer. Second, if the two arrays happen to be part of the same input array, there are in-place merge algorithms that use only O(1) auxiliary storage space and rearrange the two contiguous sequences into one sorted sequence. These two classes of algorithms both run in O(n) time, but the out-of-place merge algorithm tends to have a much lower constant factor because it does not have such stringent memory requirements.
My question is whether there is a known merging algorithm that can "interpolate" between these two approaches. That is, the algorithm would use somewhere between O(1) and O(n) memory, but the more memory it has available to it, the faster it runs. For example, if we were to measure the absolute number of array reads/writes performed by the algorithm, it might have a runtime of the form n g(s) + f(s), where s is the amount of space available to it and g(s) and f(s) are functions derivable from that amount of space available. The advantage of this function is that it could try to merge together two arrays in the most efficient way possible given memory constraints - the more memory available on the system, the more memory it would use and (ideally) the better the performance it would have.
More formally, the algorithm should work as follows. Given as input an array A consisting of two adjacent, sorted ranges, rearrange the elements in the array so that the elements are completely in sorted order. The algorithm is allowed to use external space, and its performance should be worst-case O(n) in all cases, but should run progressively more quickly given a greater amount of auxiliary space to use.
Is anyone familiar with an algorithm of this sort (or know where to look to find a description of one?)
at least according to the documentation, the in-place merge function in SGI STL is adaptive and "its run-time complexity depends on how much memory is available". The source code is available of course you could at least check this one.
EDIT: STL has inplace_merge, which will adapt to the size of the temporary buffer available. If the temporary buffer is at least as big as one of the sub-arrays, it's O(N). Otherwise, it splits the merge into two sub-merges and recurses. The split takes O(log N) to find the right part of the other sub array to rotate in (binary search).
So it goes from O(N) to O(N log N) depending on how much memory you have available.

Realloc Vs Linked List Scanning

I have to read from a file an unknown number of rows and save them in to a structure (I would like to avoid a prepocessing to count the total number of elements).
After the reading phase I have to make some computations on each of the elements of these rows.
I figured out two ways:
Use realloc each time I read a row. This way the allocation phase is slow but the computation phase is easier thanks to the index access.
Use a linked list each time I read a row. This way the allocation phase is faster but the computation phase is slower.
What is better from a complexity point of view?
How often will you traverse the linked list? If it's only once go for the linked-list. Another few things: vill there be a lot of small allocations? You could make a few smaller buffers for let's say 10 lines and link those togeteher. But that's all a question of profiling.
I'd do the simplest thing first and see if that fits my needs only then i'd think about optimizing.
Sometimes one wastes too much time thinking about the optimum even when the second best solution also fits the needs perfectly.
Without more details on how you are going to use the information, it is a bit tough to comment on the complexity. However, here are a few thoughts:
If you use realloc, it would likely be better to realloc to add "some" more items (rather than one each and every time). Typically, a good algorithm is to double the size each time.
If you use a linked list, you could speed up the access in a simple post-processing step. Allocate an array of pointers to the items and traverse the list once setting the array elements to each item in the list.
If the items are of a fixed size in the file, you could pre-compute the size simply by seeking to the end of the file, determining the size, divide by the item size and you have the result. Even if it is not a fixed size, you could possibly use this as an estimate to get "close" to the necessary size and reduce the number of reallocs required.
as other users already have stated:
Premature optimization is the root of
all evil
Donald Knuth
I have a different proposal using realloc: in the C++ STL the std::vector container grows every time an object is inserted and not enough space is available. The size of the growing depends on the current pre-allocated size but is implementation specific. For example, you could save the actual number of preallocated objects. If the size runs out, you call reallocate with the double amount of space as currently allocated. I hope this was somewhat understandable!
The caveeat is of course, that you propably will allocate more space than you actually will consume and need.

Resources