I have to read from a file an unknown number of rows and save them in to a structure (I would like to avoid a prepocessing to count the total number of elements).
After the reading phase I have to make some computations on each of the elements of these rows.
I figured out two ways:
Use realloc each time I read a row. This way the allocation phase is slow but the computation phase is easier thanks to the index access.
Use a linked list each time I read a row. This way the allocation phase is faster but the computation phase is slower.
What is better from a complexity point of view?
How often will you traverse the linked list? If it's only once go for the linked-list. Another few things: vill there be a lot of small allocations? You could make a few smaller buffers for let's say 10 lines and link those togeteher. But that's all a question of profiling.
I'd do the simplest thing first and see if that fits my needs only then i'd think about optimizing.
Sometimes one wastes too much time thinking about the optimum even when the second best solution also fits the needs perfectly.
Without more details on how you are going to use the information, it is a bit tough to comment on the complexity. However, here are a few thoughts:
If you use realloc, it would likely be better to realloc to add "some" more items (rather than one each and every time). Typically, a good algorithm is to double the size each time.
If you use a linked list, you could speed up the access in a simple post-processing step. Allocate an array of pointers to the items and traverse the list once setting the array elements to each item in the list.
If the items are of a fixed size in the file, you could pre-compute the size simply by seeking to the end of the file, determining the size, divide by the item size and you have the result. Even if it is not a fixed size, you could possibly use this as an estimate to get "close" to the necessary size and reduce the number of reallocs required.
as other users already have stated:
Premature optimization is the root of
all evil
Donald Knuth
I have a different proposal using realloc: in the C++ STL the std::vector container grows every time an object is inserted and not enough space is available. The size of the growing depends on the current pre-allocated size but is implementation specific. For example, you could save the actual number of preallocated objects. If the size runs out, you call reallocate with the double amount of space as currently allocated. I hope this was somewhat understandable!
The caveeat is of course, that you propably will allocate more space than you actually will consume and need.
Related
I know how to implement both, however I am wondering what the uses for a stack based on a fixed length array. Is there a situation where you never want your stack to grow greater than X?
If you have the choice, typically you would be doing this to make it faster.
More information could be given depending on your programming language, but typically the advantages would be speed. By using a fixed size, the size of the reserved block of memory no longer needs to be jumbled around with any changes. Your memory can get fragmented (so a little bit of your data is over here, a little bit over there, and every time you change the array this can get worse.) so using a fixed size will keep it all in one place.
Many languages have garbage collection routines, for example Javascript, where if you were creating something smooth and realtime like a video game, you'd want to use a fixed size because then it won't stall every few seconds when the garbage collection kicks in. In the case of Javascript, you get the advantage of statically typing the array as well, which means the virtual machine no longer has to infer what type the variables are and check if conversions are necessary.
If you were programming in C or C++ for example, the typical procedure for reading from files (i.e, loading an image) would be to check the size of the file you're about to load into memory, then dynamically allocate exactly the size you need in memory.
Recommended reading: Cost of array operations in Javascript. This will teach you about memory fragmentation and garbage collection. Insights gained here apply to many other languages. (for example Node, PHP, ActionScript, Ruby, etc.)
One for this could be a rolling queue or ring buffer, which means that up to the last N points of data are recorded. This is useful for recording a moving average, such as the average latency over the last N seconds. Re-using the same array elements and just moving the start and end around can reduce garbage collection as opposed to removing end elements.
I've noticed that it is very common (especially in interview questions and homework assignments) to implement a dynamic array; typically, I see the question phrased as something like:
Implement an array which doubles in capacity when full
Or something very similar. They almost always (in my experience) use the word double explicitly, rather than a more general
Implement an array which increases in capacity when full
My question is, why double? I understand why it would be a bad idea to use a constant value (thanks to this question) but it seems like it makes more sense to use a larger multiple than double; why not triple the capacity, or quadruple it, or square it?
To be clear, I'm not asking how to double the capacity of an array, I'm asking why doubling is the convention.
Yes, it is common practice.
Doubling is a good way to manage memory. Heap management algorithms are often based on the classic Buddy System, its an easy way to deal with addressing and coalescing and other challenges. Knowing this, it is good to stick with multiples of 2 when dealing with allocation (though there are hybrid algorithms, like slab allocator, to help with fragmentation, so it isn't so important as it once was to use the multiple).
Knuth covers it in one of his books that I have but forgot the title.
See http://en.wikipedia.org/wiki/Buddy_memory_allocation
Another reason to double an array size is about the addition cost. You don't want each Add() operation to trigger a reallocation call. If you've filled N slots, there is a good chance you'll need some multiple of N anyway, history is a good indicator of future needs, so the object needs to "graduate" to the next arena size. By doubling, the frequency of reallocation falls off logarithmically (Log N). Doubling is just the most convenient multiple (being the smallest whole multiplier it is more memory efficient than 3*N or 4*N, plus it tends to follow heap memory management models closely).
The reason behind doubling is that it turns repeatedly appending an element into an amortized O(1) operation. Put another way, appending n elements takes O(n) time.
More accurately, increasing by any multiplicative factor achieves that, but doubling is a common choice. I've seen other choices, such as in increasing by a factor of 1.5.
Cuda is awesome and I'm using it like crazy but I am not using her full potential because I'm having a issue transferring memory and was wondering if there was a better way to get a variable amount of memory out. Basically I send 65535 item array into Cuda and Cuda analyzes each data item around 20,000 different ways and if there's a match in my programs logic then it saves a 30 int list as a result. Think of my logic of analyzing each different combination and then looking at the total and if the total is equal to a number I'm looking for then it saves the results (which is a 30 int list for each analyzed item).
The problem is 65535 (blocks/items in data array) * 20000 (total combinations tested per item) = 1,310,700,000. This means I need to create a array of that size to deal with the chance that all the data will be a positive match (which is extremely unlikely and creating int output[1310700000][30] seems crazy for memory). I've been forced to make it smaller and send less blocks to process because I don't know how if Cuda can write efficiently to a linked list or a dynamically sized list (with this approach the it writes the output to host memory using block * number_of_different_way_tests).
Is there a better way to do this? Can Cuda somehow write to free memory that is not derived from the blockid? When I test this process on the CPU, less then 10% of the item array have a positive match so its extremely unlikely I'll use so much memory each time I send work to the kernel.
p.s. I'm looking above and although it's exactly what I'm doing, if it's confusing then another way of thinking about it (not exactly what I'm doing but good enough to understand the problem) is I am sending 20,000 arrays (that each contain 65,535 items) and adding each item with its peer in the other arrays and if the total equals a number (say 200-210) then I want to know the numbers it added to get that matching result.
If the numbers are very widely range then not all will match but using my approach I'm forced to malloc that huge amount of memory. Can I capture the results with mallocing less memory? My current approach to is malloc as much as I have free but I'm forced to run less blocks which isn't efficient (I want to run as many blocks and threads a time because I like the way Cuda organizes and runs the blocks). Is there any Cuda or C tricks I can use for this or I'm a stuck with mallocing the max possible results (and buying a lot more memory)?
As Per Roger Dahl's great answer:
The functionality you're looking for is called stream compaction.
You probably do need to provide an array that contains room for 4 solutions per thread because attempting to directly store the results in a compact form is likely to create so many dependencies between the threads that the performance gained in being able to copy less data back to the host is lost by a longer kernel execution time. The exception to this is if almost all of the threads find no solutions. In that case, you might be able to use an atomic operation to maintain an index into an array. So, for each solution that is found, you would store it in an array at an index and then use an atomic operation to increase the index. I think it would be safe to use atomicAdd() for this. Before storing a result, the thread would use atomicAdd() to increase the index by one. atomicAdd() returns the old value, and the thread can store the result using the old value as the index.
However, given a more common situation, where there's a fair number of results, the best solution will be to perform a compacting operation as a separate step. One way to do this is with thrust::copy_if. See this question for some more background.
I'm using using an array implementation of a stack, if the stack is full instead of throwing error I am doubling the array size, copying over the elements, changing stack reference and adding the new element to the stack. (I'm following a book to teach my self this stuff).
What I don't fully understand is why should I double it, why not increase it by a fixed amount, why not just increase it by 3 times.
I assume it has something to do with the time complexity or something?
A explanation would be greatly appreciated!
Doubling has just become the standard for generic implementations of things like array lists ("dynamically" sized arrays that really just do what you're doing in the background) and really most dynamically sized data types that are backed by arrays. If you knew your scenario and had the time and willpower to write a custom stack/array list implementation you could certainly write a more optimal solution.
If you knew in your software that items would be added incredibly infrequently after the initial array was built, you could initialise it with a specific size then only increase it by the size of what was being added to preserve memory.
On the other hand if you knew the list would be expanded very frequently, you might chose to increase the list size by 3 times or more when it runs out of space.
For a generic implementation that's part of a common library, your implementation specifics and requirements aren't known so doubling is just a happy medium.
In theory, you indeed arrive at different time complexities. If you increase by a constant size, you divide the number of re-allocations (and thus O(n) copies) by a constant, but you still get O(n) time complexity for appending. If you double them, you get a better time complexity for appending (armortized O(1) IIRC), and as you at most consume twice as much memory as needed, you still got the same space complexity.
In practice, it's less severe, but nevertheless viable. Copies are expensive, while a bit of memory usually doesn't hurt. It's a tradeoff, but you'd have to be quite low on memory to choose another strategy. Often, you don't know beforehand (or can't let the stack know due to API limits) how much space you'll actually need. For instance, if you build a 1024 element stack starting with one element, you get down to (I may be off by one) 10 re-allocations, from 1024/K -- assuming K=3, that would be roughly 34 times as many re-allocations, only to save a bit of memory.
The same holds for any other factor. 2 is nice because you never end up with non-integer sizes and it's still quite small, limiting the wasted space to 50%. Specific use cases may be better-served by other factors, but usually the ROI is too small to justify re-implementing and optimizing what's already available in some library.
The problem with a fixed amount is choosing that fixed amount - if you (say) choose 100 items as your fixed amount, that makes sense if your stack is currently ~100 items in size. However, if your stack is already 10,000 items in size, it's likely to grow to 11,000 items. You don't want to do 10 reallocations / moves to grow the size of your stack by 10%.
As for 2x versus 3x, that's pretty arbitrary - nothing wrong with choosing 3x; which is "better" will depend on your exact use case and how you define "better".
Scaling by 2x is easy, and will ensure that on average items get copied no more than twice [an expansion will copy half the items for the first time, a quarter for the second, an eighth for the third, etc.] If things instead grew by a fixed amount, then when e.g. the twentieth expansion was performed, half the items will be copied for the tenth time.
Growing by a factor of more than 2x will increase the average "permanent" slack space; growing by a smaller factor will increase the amount of storage that is allocated and abandoned. Depending upon the relative perceived "costs" of permanent and abandoned allocations, the optimal growth factor may be larger or smaller, but growth factors which are anywhere close to optimum will generally not perform too much worse than would optimum growth factors. Regardless of what the optimum growth factor would be, a growth factor of 2x will be close enough to yield decent performance.
I want to sort on the order of four million long longs in C. Normally I would just malloc() a buffer to use as an array and call qsort() but four million * 8 bytes is one huge chunk of contiguous memory.
What's the easiest way to do this? I rate ease over pure speed for this. I'd prefer not to use any libraries and the result will need to run on a modest netbook under both Windows and Linux.
Just allocate a buffer and call qsort. 32MB isn't so very big these days even on a modest netbook.
If you really must split it up: sort smaller chunks, write them to files, and merge them (a merge takes a single linear pass over each of the things being merged). But, really, don't. Just sort it.
(There's a good discussion of the sort-and-merge approach in volume 2 of Knuth, where it's called "external sorting". When Knuth was writing that, the external data would have been on magnetic tape, but the principles aren't very different with discs: you still want your I/O to be as sequential as possible. The tradeoffs are a bit different with SSDs.)
32 MB? thats not too big.... quicksort should do the trick.
Your best option would be to prevent having the data unordered if possible. Like it has been mentioned, you'd be better of reading the data from disk (or network or whatever the source) directly into a selforganizing container (a tree, perhaps std::set will do).
That way, you'll never have to sort through the lot, or have to worry about memory management. If you know the required capacity of the container, you might squeeze out additional performance by using std::vector(initialcapacity) or call vector::reserve up front.
You'd then best be advised to use std::make_heap to heapify any existing elements, and then add element by element using push_heap (see also pop_heap). This essentially is the same paradigm as the self-ordering set but
duplicates are ok
the storage is 'optimized' as a flat array (which is perfect for e.g. shared memory maps or memory mapped files)
(Oh, minor detail, note that sort_heap on the heap takes at most N log N comparisons, where N is the number of elements)
Let me know if you think this is an interesting approach. I'd really need a bit more info on the use case