for what do we actually require manual storage allocation?
The only possible tasks which I could think about would be for bigger binary data which does not fit to a 32bit integer.
Is this correct?
What are other use cases?
In general, you need to do manual storage allocation every time that the size of your data is not known at compile time. Almost all situations fall into two categories:
Your program must read data from files/networks/user inputs etc., and the exact amount of that data is not known at compile time, or
Your program must produce and store some output, and the exact amount of that output is not known to you at the time when you write your program.
Many very common data structures assume an ability to allocate memory of arbitrary size, when the precise size is determined at runtime. Doing so lets the data structure "grow" and "shrink" dynamically, as the storage demands of your program change with the time and the amounts of data that it processes.
tons of examples. Allocating memory to fill a struct, for example, a linked list struct.
Related
I have a bunch of files that I'm going to be processing in batches of about 1000. Some calculations are done on the files and approximately 75% of them will need to have data stored in a struct array.
I have no way of knowing how many files will need to be stored in the array until the calculations are done at runtime.
Right now I'm counting the number of files processed that need a struct array space, and using malloc(). Then for the next batch of files I use realloc(), and so on until all files are done. This way I allocate the exact amount of memory I need.
I'm able to count the total number of files in advance. Would I be better off using one big malloc() right at the start, even though it's only going to be 75% filled?
I would try it both ways, and see if there is a noticeable difference in performance.
If there isn't I would stick with using realloc() as then you wont allocate any excess memory that you don't need. You can also do something like the vector class in C++ which increases the memory logarithmically.
Libraries can implement different strategies for growth to balance between memory usage and reallocations, but in any case, reallocations should only happen at logarithmically growing intervals of size so that the insertion of individual elements at the end of the vector can be provided with amortized constant time complexity
Keep in mind that you can also allocate memory in advance, and free some of it when you are done. It really depends on what your constraints are.
I would like to know how much memory a given data structure is consuming. So suppose I have a concurrent linked list. I would like to know how big the list is. I have a few options: malloc_hooks, which I do not think is thread-safe, and getrusage's ru_maxrss, but I don't really know what that gives me (how much memory the whole process consumed during its execution?). I would like to know if someone has actually measured memory consumption in this way. Is there a tool to do this? How does massif fare?
To get an idea of how many bytes it actually costs to malloc some structure, like a linked list node, make an isolated test case(non-concurrent!) which allocates thousands of them, and look at the delta values in the program's memory usage. There are various ways to do that. If your library has a mallinfo structure, like the GNU C Library found on GNU/Linux systems, you can look at the statistics before and after. Another way is to trace the program's system calls to watch its pattern of allocating from the OS. If, say, we allocate 10,000,000 list nodes, and the program performs a sbrk() call about 39,000 times, increasing the size of the process by 8192 bytes in each call, then that means that a list node takes up 32 bytes, overhead and all.
Keeping in mind that allocating thousands of objects of the same size in a single thread does not realistically represent the actual memory usage in a realistic program, which includes fragmentation.
If you want to allocate small structures and come close to not wasting a byte (or not causing any waste that you don't know about and control), and to control fragmentation, then allocate large arrays of the objects from malloc (or your system allocator of choice) and break them up yourself. There is still unknown overhead in the malloc but it is divided over a large number of objects, making it negligible.
Or, generally, write your own allocator whose behavior and overheads you understand in detail, and which itself takes big chunks from the system.
Conceptually speaking you need to know the number of items you are working with. Then you need to know the size of each different data type used in your data structure. You also will have to take into account the size of pointers or anything that is somewhat using some sort of memory.
Then you can come up with a formula that looks like the following:
Consumption= N *(sizeof(data types) ).
So in other words you want to make sure you add any data type together (the data type's size) and multiply it by the number of items.
I have to sort a large amount of data that can not fit in memory, and one thing could do this I know is "external sort". But I am wondering is that possible to mmap this large data file, and use 'qsort' as it is a 'normal data array'? If that's feasible, what's the differences with 'external sort'?
If the file will fit in a contiguous mapping in your address space, you can do this. If it won't, you can't.
As to the differences:
if the file just about fits, and then you add some more data, the mmap will fail. A normal external sort won't suddenly stop working because you have a little more data.
if you don't map it with MAP_PRIVATE, sorting will mutate the original file. A normal external sort won't (necessarily)
if you do map it with MAP_PRIVATE, you could crash at any time if the VM doesn't have room to duplicate the whole file. Again, a strictly external sort's memory requirements don't scale linearly with the data size.
tl;dr
It is possible, it may fail unpredictably and unrecoverably, you almost certainly shouldn't do it.
It should definitely work if the data fits in address space (almost certainly does on 64-bit machines; might or might not on 32-bit ones), but performance will depend on a lot on the underlying algorithm used by qsort and its data locality properties. One issue to consider is whether it's the number of elements that's huge, or whether each element is large on disk. In the latter case, you'd be better off doing the mmap, but allocating a separate array of pointers to each element, then sorting the pointer array with a comparison function that compares what they point to. This will drastically reduce the number of times data gets moved around in memory, but it will take a little work at the end if you want to store the output back to the same file.
Yes, this is possible as long as you have fixed-length records in the file and the file fits within a range of contiguous VM addresses, and in fact this can be considered a naive approach to external sorting. It may not be the fastest algorithm in town, though, since qsort implementations will not be tuned for this use case.
I need to have two buffers (A and B) and when either of the buffers is full it needs to write its contents to the "merged" buffer - C. Using memcopy seems to be too slow for this operation as noted below in my question. Any insight?'
I haven't tried but I've been told that memcopy will not work. This is an embedded system. 2 buffers. Both of different sizes and when they are full dumb to a common 'C' buffer which is a bigger size than the other two.. Not sure why I got down rated..
Edit: Buffer A and B will be written to prior to C being completely empty.
The memcopy is taking too long and the common buffer 'C' is getting over run.
memcpy is pretty much the fastest way to copy memory. It's frequently a compiler intrinsic and is highly optimized. If it's too slow you're probably going to have to find another way to speed your program up.
I'd expect that copying memory faster is not the lowest hanging fruit in a program.
Some other opportunities could be to copy less memory or copy less often. See if you can profile your program to analyze it's performance and find where the biggest opportunities are.
Edit: With your edit it sounds like the problem is that there's not enough time for you to deal with some data all at once between the time you notice that it needs to be handled and the time that more data comes in. A solution in this case could be, as one of the commenters noted, to have additional buffers that you can flip between. So you may then have time to handle the data in one while another is filled up.
The only way you can merge two buffers without memcpy is by linking them, like a linked list of buffer fragments (or an array of fragments).
Consider that a buffer may not always have to be contiguous. I've done a lot of work with 600dpi images, which means very large buffers. If you can break them up into a sequence of smaller fragments, that helps reducing fragmentation as well as unnecessary copying due to buffer growth.
In some cases buffers must be contiguous, if your API / microcontroller mandates it. For example, Windows bitmap functions require continuity. You could try to use the C realloc function, but it might internally work like the combination of malloc+memcpy+free. Either way, as others have said earlier, memcpy is supposed to be the fastest possible way of copying contiguous buffers.
If the buffer must be contiguous, you could reserve a large address space and commit it on demand. The implementation depends on the platform. For example, on Win32 the VirtualAlloc function can do that. This gives you a very large contiguous buffer, of which only a portion is allocated (committed). Later you can commit further pages as the buffer needs to grow. This trick requires the concept of virtual memory, which may not be available on a microcontroller.
I have an array of integers which size is known before the kernel launch but not during the compilation stage. The upper bound on the size is around 10000 float3 elements (I guess that means 10000 * 3 * 4 = ~120KB). It is not known at the compile time.
All threads scan linearly through (at most) all of the elements in the array.
You could check the size at runtime, then if it will fit use cudaMemcpyToSymbol, or otherwise use texture or global memory. This is slightly messy, you will have to have some parameter to tell the kernel where the data is. As always, always test actual performance. Different access patterns can have drastically different speeds in different types of memory.
Another thought is to take a step back and look at the algorithm again. There are often ways of dividing the problem differently to get the constant table to always fit into constant memory.
If all threads in a warp access the same elements at the same time then you should probably consider using constant memory, since this is not only cached, but it also has a broadcast capability whereby all threads can read the same address in a single cycle.
You could calculate the free constant memory after compile your kernels and allocate it statically.
__constant__ int c[ALL_I_CAN_ALLOCATE];
Then, copy your data to constant memory using cudaMemcpyToSymbol().
I think this might answer your question but your requirement for constant memory exceed the limits of the GPU.
I'll recommend other approaches, i.e. use the share memory which can broadcast data if all threads in a halfwarp read from the same location.