Combining two buffers into one - c

I need to have two buffers (A and B) and when either of the buffers is full it needs to write its contents to the "merged" buffer - C. Using memcopy seems to be too slow for this operation as noted below in my question. Any insight?'
I haven't tried but I've been told that memcopy will not work. This is an embedded system. 2 buffers. Both of different sizes and when they are full dumb to a common 'C' buffer which is a bigger size than the other two.. Not sure why I got down rated..
Edit: Buffer A and B will be written to prior to C being completely empty.
The memcopy is taking too long and the common buffer 'C' is getting over run.

memcpy is pretty much the fastest way to copy memory. It's frequently a compiler intrinsic and is highly optimized. If it's too slow you're probably going to have to find another way to speed your program up.
I'd expect that copying memory faster is not the lowest hanging fruit in a program.
Some other opportunities could be to copy less memory or copy less often. See if you can profile your program to analyze it's performance and find where the biggest opportunities are.
Edit: With your edit it sounds like the problem is that there's not enough time for you to deal with some data all at once between the time you notice that it needs to be handled and the time that more data comes in. A solution in this case could be, as one of the commenters noted, to have additional buffers that you can flip between. So you may then have time to handle the data in one while another is filled up.

The only way you can merge two buffers without memcpy is by linking them, like a linked list of buffer fragments (or an array of fragments).
Consider that a buffer may not always have to be contiguous. I've done a lot of work with 600dpi images, which means very large buffers. If you can break them up into a sequence of smaller fragments, that helps reducing fragmentation as well as unnecessary copying due to buffer growth.
In some cases buffers must be contiguous, if your API / microcontroller mandates it. For example, Windows bitmap functions require continuity. You could try to use the C realloc function, but it might internally work like the combination of malloc+memcpy+free. Either way, as others have said earlier, memcpy is supposed to be the fastest possible way of copying contiguous buffers.
If the buffer must be contiguous, you could reserve a large address space and commit it on demand. The implementation depends on the platform. For example, on Win32 the VirtualAlloc function can do that. This gives you a very large contiguous buffer, of which only a portion is allocated (committed). Later you can commit further pages as the buffer needs to grow. This trick requires the concept of virtual memory, which may not be available on a microcontroller.

Related

Memory allocation that resizes a buffer ONLY if it can grow in place?

After reading the man-page for realloc(), I came to the realization that it works a little differently than I thought it did. I originally thought that realloc() would attempt to resize a buffer, previously allocated with one of the malloc-family functions, and if it could NOT extend the buffer in place, then it would fail. However, the man-page states:
The realloc() function returns a pointer to the newly allocated memory, which is suitably aligned for any built-in type and may be different from ptr, or NULL if the request fails.
The "may be different from ptr" part is what I'm talking about.
Basically, what I want is a function, similar to realloc(), but which fails if it cannot extend the buffer in place. It seems that there is no function in the standard C library that does this; however, I'm assuming there may be some OS-specific functions that accomplish the same thing.
Could someone tell me what functions are out there that do what I described above, and which OS's they are specific to? Preferably, I'd like to know at least the functions specific to Linux and Windows (and Mac OS would be a nice bonus too :) ).
This may be a duplicate of this post, but I don't think it is for the following reasons:
The question in the post I linked to simply asks, is there a function that extends a buffer in place, whereas, I'm asking, which functions extend a buffer in place.
The accepted answer for that post does not contain the information I need.
EDIT
Some people were wondering what is the use case I need this for, so I'll explain, below:
I'm writing a C preprocessor (yes, I know... don't reinvent the wheel... well, I'm doing it anyways, so there). And one component of the C preprocessor is a cache for storing pp-tokens which come from various source files, where each source file's set of pp-tokens may be fragmented within the cache. The cache itself, is a linked-list of large chunks of memory. Ideally, I'd like to keep this linked-list short, hence why I'd like to first try resizing the buffer (in place); however, if resizing in place is not possible, then I want to just add another node (i.e. chunk of memory) to the linked list.
Within each cache buffer, there are additional linked-list nodes, which provide a means for iterating through all the pp-tokens of each individual source file, which may be fragmented across the various cache buffers that make up the cache.
The reasons I need the kind of memory reallocation I discussed earlier are the following:
If resizing a cache buffer could not be done in place, and a new buffer had to be allocated and the old memory contents copied, then I'd have a lot of dangling pointers. Jonathan Leffler suggested that I instead store offsets within the buffer, rather than pointers, which I had not even thought about, and is a great idea! However, reason #2...
I want the implementation of the cache to be as fast as possible, and, please correct me if I'm wrong, but it seems to me that (for my use case) it would be faster on average to just add a new cache buffer to the linked list if a given cache buffer could not be resized in place, rather than allocating a new buffer and copying all previous contents and freeing the old buffer. As a sidenote, I am planning on doubling the size of the allocated cache buffer each time cache resizing is needed.
Memory management (in the form of malloc and friends) is generally implemented as a library; it is not part of the Operating System. (An implementation of the library will probably need to use some OS facilities to acquire raw memory -- although that's not a given -- but there is no need to involve the OS for allocating and freeing individual allocations.) So you're not going to find an "OS-specific" solution.
There are a number of different memory allocation libraries available. If you decide to use an alternative to the one preinstalled with your particular distribution, you will probably want to arrange for it to be used by the standard library as well. Details for how to do that vary.
Most allocation libraries do include some additional interfaces, but I don't know of any library which offers the function you're looking for. More common is an API for finding out how much memory is actually in an allocation (which is often more than the amount requested by the malloc). For many libraries, realloc will only expand the allocation in place if it was already big enough, but there may be libraries which are willing to merge a following free block in order to make non-copying realloc possible.
There's a list of some commonly-used libraries in the Wikipedia page on dynamic memory allocation, which also has a good overview of implementation techniques.
And, of course, you could always write your own memory manager (or modify an open source library) to implement that feature. However, while that would be an interesting and satisfying project, I'd strongly suggest you think about (and research) the reasons why this seemingly simple idea has not been implemented in common memory management libraries. There are good reasons.

optimal way of using malloc and realloc for dynamic storing

I'm trying to figure out what is the optimal way of using malloc and realloc for recieving unknown amount of characters from the user ,storing them, and printing them only by the end.
I've figured that calling realloc too many times wont be so smart.
so instead, I allocate a set amount of space each time,lets say
sizeof char*100
and by the end of file,i use realloc to fit the size of the whole thing precisely.
what do you think?is this a good way to go about?
would you go in a different path?
Please note,I have no intention of using linked lists,getchar(),putchar().
using malloc and realloc only is a must.
If you realloc to fit the exact amount of data needed, then you are optimizing for memory consumption. This will likely give slower code because 1) you get extra realloc calls and 2) you might not allocate amounts that fit well with CPU alignment and data cache. Possibly this also causes heap segmentation issues because of the repeated reallocs, in which case it could actually waste memory.
It's hard to answer what's "best" generically, but the below method is fairly common, as it is a good compromise between reducing execution speed for realloc calls and lowering memory use:
You allocate a segment, then keep track of how much of this segment that is user data. It is a good idea to allocate size_t mempool_size = n * _Alignof(int); bytes and it is probably also wise to use a n which is divisible by 8.
Each time you run out of free memory in this segment, you realloc to mempool_size*2 bytes. That way you keep doubling the available memory each time.
I've figured that calling realloc too many times wont be so smart.
How have you figured it out? Because the only way to really know is to measure the performance.
Your strategy may need to differ based on how you are reading the data from the user. If you are using getchar() you probably don't want to use realloc() to increase the buffer size by one char each time you read a character. However, a good realloc() will be much less inefficient than you think even in these circumstances. The minimum block size that glibc will actually give you in response to a malloc() is, I think, 16 bytes. So going from 0 to 16 characters and reallocing each time doesn't involve any copying. Similarly for larger reallocations, a new block might not need to be allocated, it may be possible to make the existing block bigger. Don't forget that even at its slowest, realloc() will be faster than a person can type.
Most people don't go for that strategy. What can by typed can be piped so the argument that people don't type very fast doesn't necessarily work. Normally, you introduce the concept of capacity. You allocate a buffer with a certain capacity and when it gets full, you increase its capacity (with realloc()) by adding a new chunk of a certain size. The initial size and the reallocation size can be tuned in various ways. If you are reading user input, you might go for small values e.g. 256 bytes, if you are reading files off disk or across the network, you might go for larger values e.g. 4Kb or bigger.
The increment size doesn't even need to be constant, you could choose to double the size for each needed reallocation. This is the strategy used by some programming libraries. For example the Java implementation of a hash table uses it I believe and so possibly does the Cocoa implementation of an array.
It's impossible to know beforehand what the best strategy in any particular situation is. I would pick something that feels right and then, if the application has performance issues, I would do testing to tune it. Your code doesn't have to be the fastest possible, but only fast enough.
However one thing I absolutely would not do is overlay a home rolled memory algorithm over the top of the built in allocator. If you find yourself maintaining a list of blocks you are not using instead of freeing them, you are doing it wrong. This is what got OpenSSL into trouble.

What is the optimal amount to malloc at a given time when the total needed is not known?

I've implemented a multi-level cache simulator that needs to store the values currently in the simulator. With current configurations, the maximum size of all values being stored could reach 2G. Obviously I'm not going to assume this worst case scenario and allocate all of that memory up-front. Instead, I have the program set to allocate memory as needed in chunks. The expense of this allocation is exacerbated by the fact that I'm callocing in order to provide 0 values when no write has occurred previously at the specified location.
My question is, is there a good heuristic for how much memory should be allocated each time more is needed? Currently I'm using an arbitrary value and I considered some solution that would use some ratio of the total system memory (I presume it's possible to dynamically detect this at compile and/or runtime), but even with the latter I'm using an arbitrary ratio with still doesn't sit well with me.
Any insight into best practices for this kind of situation would be appreciated!
A common rule of thumb is to grow geometrically, for example by doubling, on each reallocation.
It's best to understand allocation patterns of your program, if this is a problem you need to optimize for. This comes by understanding the program's implementation, the architecture(s) it runs within, and by observation (e.g. time and memory profiling).
The truth is, you can optimize from many perspectives, but things change over time (inputs change, environments change). In the user-land, your memory usage is already second guessed.
Given your allocation sizes, I assume you are already depending on a system which will default to a backing store as needed. As such, you don't have much control over what is paged or when. Peeking at available physical memory is not worth consideration in this case, and you will have to work hard to do better than the system's existing virtual memory implementation. Several of these systems try to use all available memory (e.g. "Unused RAM is wasted RAM").
Having said that and if those assumptions are correct: It's often better to just reduce your allocation sizes and working sets and do I/O yourself as needed.
Your OSs probably use disk caching as well; reads and writes are probably faster than you suspect for large blocks of memory.
Even deeper: Use virtual memory or memory mapped files for these large data sets. Your kernel will likely handle these cases very well.
Obviously I'm not going to assume this worst case scenario and allocate all of that memory up-front.
Then you will likely be surprised to learn that a 2 GB calloc alone may be better than other alternatives people come up with in some environments because a large calloc could just reserve a domain in virtual memory, loading/initializing pages only when you access them. Depending on your usage, this approach will be much better than some alternatives you may be given.
A good starting point for many problems when understanding a program or input's allocation patterns is to start out conservative, and then make the most beneficial adjustments based on observation. In many cases, you will need little more information than a) accurately determining how much to resize by when resizing is necessary b) reusing allocations where appropriate c) designing your data well for the problem at hand.

Memory consumption of a concurrent data structure in C

I would like to know how much memory a given data structure is consuming. So suppose I have a concurrent linked list. I would like to know how big the list is. I have a few options: malloc_hooks, which I do not think is thread-safe, and getrusage's ru_maxrss, but I don't really know what that gives me (how much memory the whole process consumed during its execution?). I would like to know if someone has actually measured memory consumption in this way. Is there a tool to do this? How does massif fare?
To get an idea of how many bytes it actually costs to malloc some structure, like a linked list node, make an isolated test case(non-concurrent!) which allocates thousands of them, and look at the delta values in the program's memory usage. There are various ways to do that. If your library has a mallinfo structure, like the GNU C Library found on GNU/Linux systems, you can look at the statistics before and after. Another way is to trace the program's system calls to watch its pattern of allocating from the OS. If, say, we allocate 10,000,000 list nodes, and the program performs a sbrk() call about 39,000 times, increasing the size of the process by 8192 bytes in each call, then that means that a list node takes up 32 bytes, overhead and all.
Keeping in mind that allocating thousands of objects of the same size in a single thread does not realistically represent the actual memory usage in a realistic program, which includes fragmentation.
If you want to allocate small structures and come close to not wasting a byte (or not causing any waste that you don't know about and control), and to control fragmentation, then allocate large arrays of the objects from malloc (or your system allocator of choice) and break them up yourself. There is still unknown overhead in the malloc but it is divided over a large number of objects, making it negligible.
Or, generally, write your own allocator whose behavior and overheads you understand in detail, and which itself takes big chunks from the system.
Conceptually speaking you need to know the number of items you are working with. Then you need to know the size of each different data type used in your data structure. You also will have to take into account the size of pointers or anything that is somewhat using some sort of memory.
Then you can come up with a formula that looks like the following:
Consumption= N *(sizeof(data types) ).
So in other words you want to make sure you add any data type together (the data type's size) and multiply it by the number of items.

Is that possible to mmap a very big file and using qsort?

I have to sort a large amount of data that can not fit in memory, and one thing could do this I know is "external sort". But I am wondering is that possible to mmap this large data file, and use 'qsort' as it is a 'normal data array'? If that's feasible, what's the differences with 'external sort'?
If the file will fit in a contiguous mapping in your address space, you can do this. If it won't, you can't.
As to the differences:
if the file just about fits, and then you add some more data, the mmap will fail. A normal external sort won't suddenly stop working because you have a little more data.
if you don't map it with MAP_PRIVATE, sorting will mutate the original file. A normal external sort won't (necessarily)
if you do map it with MAP_PRIVATE, you could crash at any time if the VM doesn't have room to duplicate the whole file. Again, a strictly external sort's memory requirements don't scale linearly with the data size.
tl;dr
It is possible, it may fail unpredictably and unrecoverably, you almost certainly shouldn't do it.
It should definitely work if the data fits in address space (almost certainly does on 64-bit machines; might or might not on 32-bit ones), but performance will depend on a lot on the underlying algorithm used by qsort and its data locality properties. One issue to consider is whether it's the number of elements that's huge, or whether each element is large on disk. In the latter case, you'd be better off doing the mmap, but allocating a separate array of pointers to each element, then sorting the pointer array with a comparison function that compares what they point to. This will drastically reduce the number of times data gets moved around in memory, but it will take a little work at the end if you want to store the output back to the same file.
Yes, this is possible as long as you have fixed-length records in the file and the file fits within a range of contiguous VM addresses, and in fact this can be considered a naive approach to external sorting. It may not be the fastest algorithm in town, though, since qsort implementations will not be tuned for this use case.

Resources