Does GEMM require additional memory to work? - c

Context: I am using LAPACK/BLAS/MKL in C to diagonalize and multiply matrices of side length O(10000).
Normally, diagonalization procedures like zheev and dsyev in LAPACK/MKL require additional memory. This is allocated by passing LWORK=-1 and then allocating the returned value. Most interfaces provide a way to take this extra code away and just pass on the array values, while the explicit interface is available with an extra _work in the name.
However, the documentation for matrix multiplication routines like dgemm and zgemm does not mention any such working space requirements.
Question: I want to know if there is actually a memory requirement, but it is just hidden away, and I haven't found it despite looking for it? Or is there no additional memory requirement?
Purpose: Asking this as I need to know memory requirements in order to run the code on a HPC cluster.

Related

memory allocation/deallocation for embedded devices

Currently we use malloc/free Linux commands for memory allocation/de-allocation in our C based embedded application. I heard that this would cause memory fragmentation as the heap size increases/decreases because of memory allocation/de-allocation which would result in performance degradation. Other programming languages with efficient Garbage Collection solves this issue by freeing the memory when not in use.
Are there any alternate approaches which would solve this issue in C based embedded programs ?
You may take a look at a solution called memory pool allocation.
See: Memory pools implementation in C
Yes, there's an easy solution: don't use dynamic memory allocation outside of initialization.
It is common (in my experience) in embedded systems to only allow calls to malloc when a program starts (this is usually done by convention, there's nothing in C to enforce this. Although you can create your own wrapper for malloc to do this). This requires more work to analyze what memory your program could possibly use since you have to allocate it all at once. The benefit you get, however, is a complete understanding of what memory your program uses.
In some cases this is fairly straightforward, in particular if your system has enough memory to allocate everything it could possibly need all at once. In severely memory-limited systems, however, you're left with the managing the memory yourself. I've seen this done by writing "custom allocators" which you allocate and free memory from. I'll provide an example.
Let's say you're implementing some mathematical program that needs lots of big matrices (not terribly huge, but for example 1000x1000 floats). Your system may not have the memory to allocate many of these matrices, but if you can allocate at least one of them, you could create a pool of memory used for matrix objects, and every time you need a matrix you grab memory from that pool, and when you're done with it you return it to the pool. This is easy if you can return them in the same order you got them in, meaning the memory pool works just like a stack. If this isn't the case, perhaps you could just clear the entire pool at the end of each "iteration" (assuming this math system is periodic).
With more detail about what exactly you're trying to implement I could provide more relevant/specific examples.
Edit: See sg7's answer as well: that user provides a link to well-established frameworks which implement what I describe here.

Finding roots for garbage collection in C

I'm trying to implement a simple mark and sweep garbage collector in C. The first step of the algorithm is finding the roots. So my question is how can I find the roots in a C program?
In the programs using malloc, I'll be using the custom allocator. This custom allocator is all that will be called from the C program, and may be a custom init().
How does garbage collector knows what all the pointers(roots) are in the program? Also, given a pointer of a custom type how does it get all pointers inside that?
For example, if there's a pointer p pointing to a class list, which has another pointer inside it.. say q. How does garbage collector knows about it, so that it can mark it?
Update: How about if I send all the pointer names and types to GC when I init it? Similarly, the structure of different types can also be sent so that GC can traverse the tree. Is this even a sane idea or am I just going crazy?
First off, garbage collectors in C, without extensive compiler and OS support, have to be conservative, because you cannot distinguish between a legitimate pointer and an integer that happens to have a value that looks like a pointer. And even conservative garbage collectors are hard to implement. Like, really hard. And often, you will need to constrain the language in order to get something acceptable: for instance, it might be impossible to correctly collect memory if pointers are hidden or obfuscated. If you allocate 100 bytes and only keep a pointer to the tenth byte of the allocation, your GC is unlikely to figure out that you still need the block since it will see no reference to the beginning. Another very important constraint to control is the memory alignment: if pointers can be on unaligned memory, your collector can be slowed down by a factor of 10x or worse.
To find roots, you need to know where your stacks start, and where your stacks end. Notice the plural form: each thread has its own stack, and you might need to account for that, depending on your objectives. To know where a stack starts, without entering into platform-specific details (that I probably wouldn't be able to provide anyways), you can use assembly code inside the main function of the current thread (just main in a non-threaded executable) to query the stack register (esp on x86, rsp on x86_64 to name those two only). Gcc and clang support a language extension that lets you assign a variable permanently to a register, which should make it easy for you:
register void* stack asm("esp"); // replace esp with the name of your stack reg
(register is a standard language keyword that is most of the time ignored by today's compilers, but coupled with asm("register_name"), it lets you do some nasty stuff.)
To ensure you don't forget important roots, you should defer the actual work of the main function to another one. (On x86 platforms, you can also query ebp/rbp, the stack frame base pointers, instead, and still do your actual work in the main function.)
int main(int argc, const char** argv, const char** envp)
{
register void* stack asm("esp");
// put stack somewhere
return do_main(argc, argv, envp);
}
Once you enter your GC to do collection, you need to query the current stack pointer for the thread you've interrupted. You will need design-specific and/or platform-specific calls for that (though if you get something to execute on the same thread, the technique above will still work).
The actual hunt for roots starts now. Good news: most ABIs will require stack frames to be aligned on a boundary greater than the size of a pointer, which means that if you trust every pointer to be on aligned memory, you can treat your whole stack as a intptr_t* and check if any pattern inside looks like any of your managed pointers.
Obviously, there are other roots. Global variables can (theoretically) be roots, and fields inside structures can be roots too. Registers can also have pointers to objects. You need to separately account for global variables that can be roots (or forbid that altogether, which isn't a bad idea in my opinion) because automatic discovery of those would be hard (at least, I wouldn't know how to do it on any platform).
These roots can lead to references on the heap, where things can go awry if you don't take care.
Since not all platforms provide malloc introspection (as far as I know), you need to implement the concept of scanned memory--that is, memory that your GC knows about. It needs to know at least the address and the size of each of such allocation. When you get a reference to one of these, you simply scan them for pointers, just like you did for the stack. (This means that you should take care that your pointers are aligned. This is normally the case if you let your compiler do its job, but you still need to be careful when you use third-party APIs).
This also means that you cannot put references to collectable memory to places where the GC can't reach it. And this is where it hurts the most and where you need to be extra-careful. Otherwise, if your platform supports mallocĀ introspection, you can easily tell the size of each allocation you get a pointer to and make sure you don't overrun them.
This just scratches the surface of the topic. Garbage collectors are extremely complex, even when single-threaded. When you add threads to the mix, you enter a whole new world of hurt.
Apple has implemented such a conservative GC for the Objective-C language and dubbed it libauto. They have open-sourced it, along with a good part of the low-level technologies of Mac OS X, and you can find the source here.
I can only quote Hot Licks here: good luck!
Okay, before I go even further, I forgot something very important: compiler optimizations can break the GC. If your compiler is not aware of your GC, it can very well never put certain roots on the stack (only dealing with them in registers), and you're going to miss them. This is not too problematic for single-threaded programs if you can inspect registers, but again, a huge mess for multithreaded programs.
Also be very careful about the interruptibility of allocations: you must make sure that your GC cannot kick in while you're returning a new pointer because it could collect it right before it is assigned to a root, and when your program resumes it would assign that new dangling pointer to your program.
And here's an update to address the edit:
Update: How about if I send all the pointer names and types to GC when
I init it? Similarly, the structure of different types can also be
sent so that GC can traverse the tree. Is this even a sane idea or am
I just going crazy?
I guess you could allocate our memory then register it with the GC to tell it that it should be a managed resource. That would solve the interruptability problem. But then, be careful about what you send to third-party libraries, because if they keep a reference to it, your GC might not be able to detect it since they won't register their data structures with your GC.
And you likely won't be able to do that with roots on the stack.
The roots are basically all static and automatic object pointers. Static pointers would be linked inside the load modules. Automatic pointers must be found by scanning stack frames. Of course, you have no idea where in the stack frames the automatic pointers are.
Once you have the roots you need to scan objects and find all the pointers inside them. (This would include pointer arrays.) For that you need to identify the class object and somehow extract from it information about pointer locations. Of course, in C many objects are not virtual and do not have a class pointer within them.
Good luck!!
Added: One technique that could vaguely make your quest possible is "conservative" garbage collection. Since you intend to have your own allocator, you can (somehow) keep track of allocation sizes and locations, so you can pick any pointer-sized chunk out of storage and ask "Might this possibly be a pointer to one of my objects?" You can, of course, never know for sure, since random data might "look like" a pointer to one of your objects, but still you can, through this mechanism, scan a chunk of storage (like a frame in the call stack, or an individual object) and identify all the possible objects it might address.
With a conservative collector you cannot safely do object relocation/compaction (where you modify pointers to objects as you move them) since you might accidentally modify "random" data that looks like an object pointer but is in fact meaningful data to some application. But you can identify unused objects and free up the space they occupy for reuse. With proper design it's possible to have a very effective non-compacting GC.
(However, if your version of C allows unaligned pointers scanning could be very slow, since you'd have to try every variation on byte alignment.)

How to manipulate dynamic multidimensional-array in C

I need to alloc a large multidimensional-array as char a[x][32][y], and x*32*y is about 6~12G. (x, y are detenmined at runtime.)
I think out a way that is to do char *a=malloc(x*32*y), and use *(a+32*y*i+y*j+k) for a[i][j][k].
However, this looks not so convient comparing to a[i][j][k].
Is there any better way ?
Added:
It is a[x][32][datlen], where datlen is detenmined at runtime and x is set considering the memory.
The whole data in the array will be new. And I have got mathines with 16 or 32GB memory to run it.
INCORRECT: You should still be able to use a[i][j][k] syntax when referencing dynamically allocated memory.
CORRECT: Use a macro to at least make the job easier
#define A(i,j,k) *(a+32*y*i+y*j+k)
A(1,2,3) would then do the right thing.
I doubt you'll find a system which will allocate you contiguous memory that large*. You're going to have to utilize a chunking strategy of some kind.
You need to ask, "What is your data access pattern?"
If it is some stride (be it 1D or 2D), use that to choose an appropriate allocation of memory for each chunk. Use a data structure to represent each stride (might just be a struct containing your character arrays).
Edit: I didn't notice your second "question" about accessing your newly found 12G contiguous chunk of memory using a[i][j][k] syntax. That isn't going to happen in any consumer grade C distribution I'm aware of.
(*) and 640k ought to be enough memory for anyone.
Since this is C you cannot wrap everything up into a handy C++ object.
But I would do something similar. Design a series of functions that allocate, manipulate and destroy this new data type of yours.
To read or write a piece of the data, call a function. Never touch the data directly. In fact, if you can use a void* handle to your data and not even put the real data types in an included header file, that's the best thing to do.
With this, you can define the functions as operating on one very large memory block, a set of large memory blocks or even an on-disk database of blocks.
Now that I wrote that, let me partly take it back. If you need more performance you might want to define all of the functions in the included header file as inline definitions. That will let your compiler remove almost all the function call overhead and optimize aggressively.
I admit that matrix_set(x, y, z, value) is not as pretty as matrix[x][y][z] = value, but it will work just as well.

Searching for a safe magic number for in-memory data structure

I'm implementing a heap allocator (malloc), and I need to choose a magic number to check if a given pointer point to a data structure I allocated. It seems obvious to me that no magic number can be considered completely safe (if the number is checked, I can be sure a point to one of my data structure), but maybe I missed something, so... if someone can help and bring me the number of my dreams, I'd really appreciate.
Thx in advance.
It depends on what you're doing this for. If you're doing it to try and catch programming mistakes (e.g., you want to make sure you don't accidentally mix up my_malloc/my_free and malloc/free), then just pick a random value. Sure, sometimes it'll fail to detect such a case, but that really doesn't matter. It shouldn't ever happen. So, here:
#define MAGIC_32BIT 0x77A5844CU
#define MAGIC_64BIT 0xD221A6BE96E04673UL
If correctness depends on this, then you really ought to do this another way. For example, by keeping track of which addresses you've allocated in a hash or tree or, in special cases, a bitmap.
If you're actually implementing malloc/free (e.g., writing your own C library), then keep in mind that freeing something that wasn't malloced (except NULL) is undefined behavior by the standard, so your code doesn't need to worry what happens.
Rather than picking a single magic number, you should use a random number (preferably with at least one of the lower 8 bits set -- you can force this by ORing in 1, for instance) or some constant -- your choice, and then XOR it (^) with an address (e.g., the address you are checking). That approach will dramatically reduce the odds of an accidental collision.
For example, when you write the object header (or page header, depending on the kind of allocator you are writing), store MAGIC ^ addr. Now when you want to check if addr is valid, just see if value == addr ^ MAGIC (with appropriate casts, of course).
By the way, before embarking on creating your own custom memory allocator, please read this paper (Reconsidering Custom Memory Allocation, by Berger, Zorn and McKinley), from OOPSLA 2002.
http://www.cs.umass.edu/~emery/pubs/berger-oopsla2002.pdf
Abstract:
Programmers hoping to achieve performance improvements often
use custom memory allocators. This in-depth study examines eight
applications that use custom allocators. Surprisingly, for six of
these applications, a state-of-the-art general-purpose allocator (the
Lea allocator) performs as well as or better than the custom allocators. The two exceptions use regions, which deliver higher performance (improvements of up to 44%). Regions also reduce programmer burden and eliminate a source of memory leaks. However, we show that the inability of programmers to free individual
objects within regions can lead to a substantial increase in memory
consumption. Worse, this limitation precludes the use of regions
for common programming idioms, reducing their usefulness.
We present a generalization of general-purpose and region-based
allocators that we call reaps. Reaps are a combination of regions
and heaps, providing a full range of region semantics with the addition of individual object deletion. We show that our implementation of reaps provides high performance, outperforming other allocators with region-like semantics. We then use a case study to
demonstrate the space advantages and software engineering beneļ¬ts of reaps in practice. Our results indicate that programmers
needing fast regions should use reaps, and that most programmers
considering custom allocators should instead use the Lea allocator.
I haven't done such a thing (I worked with heaps but I did notimplemented any allocator) and I'm not certain of what you're trying to do, but maybe this is a case you should use hashing.
Depending on what exactly you're doing, it means to hash the address of the memory chunk, or the data it contains (and every time you change something, this implies recalculating the hash) or some kind of ID of the memory operation.
Again, I am not certain of what you are trying to achieve, then those are my 2 cents.
TALLOC_MAGIC 0xe814ec70
This is from the file talloc.c in the source code here.
Of course, you will have to look at why talloc chose this magic number, but it's a start.

simple c malloc

While there are lots of different sophisticated implementations of malloc / free for C/C++, I'm looking for a really simple and (especially) small one that works on a fixed-size buffer and supports realloc. Thread-safety etc. are not needed and my objects are small and do not vary much in size. Is there any implementation that you could recommend?
EDIT:
I'll use that implementation for a communication buffer at the receiver to transport objects with variable size (unknown to the receiver). The allocated objects won't live long, but there are possibly several objects used at the same time.
As everyone seems to recommend the standard malloc, I should perhaps reformulate my question. What I need is the "simplest" implementation of malloc on top of a buffer that I can start to optimize for my own needs. Perhaps the original question was unclear because I'm not looking for an optimized malloc, only for a simple one. I don't want to start with a glibc-malloc and extend it, but with a light-weight one.
Kerninghan & Ritchie seem to have provided a small malloc / free in their C book - that's exactly what I was looking for (reimplementation found here). I'll only add a simple realloc.
I'd still be glad about suggestions for other implementations that are as simple and concise as this one (for example, using doubly-linked lists).
I recommend the one that came with standard library bundled with your compiler.
One should also note there is no legal way to redefine malloc/free
The malloc/free/realloc that come with your compiler are almost certainly better than some functions you're going to plug in.
It is possible to improve things for fixed-size objects, but that usually doesn't involve trying to replace the malloc but rather supplementing it with memory pools. Typically, you would use malloc to get a large chunk of memory that you can divide into discrete blocks of the appropriate size, and manage those blocks.
It sounds to me that you are looking for a memory pool. The Apache Runtime library has a pretty good one, and it is cross-platform too.
It may not be entirely light-weight, but the source is open and you can modify it.
There's a relatively simple memory pool implementation in CCAN:
http://ccodearchive.net/info/antithread/alloc.html
This looks like fits your bill. Sure, alloc.c is 1230 lines, but a good chunk of that is test code and list manipulation. It's a bit more complex than the code you implemented, but decent memory allocation is complicated.
I would generally not reinvent the wheel with allocation functions unless my memory-usage pattern either is not supported by malloc/etc. or memory can be partitioned into one or more pre-allocated zones, each containing one or two LIFO heaps (freeing any object releases all objects in the same heap that were allocated after it). In a common version of the latter scenario, the only time anything is freed, everything is freed; in such a case, malloc() may be usefully rewritten as:
char *malloc_ptr;
void *malloc(int size)
{
void *ret;
ret = (void*)malloc_ptr;
malloc_ptr += size;
return ret;
}
Zero bytes of overhead per allocated object. An example of a scenario where a custom memory manager was used for a scenario where malloc() was insufficient was an application where variable-length test records produced variable-length result records (which could be longer or shorter); the application needed to support fetching results and adding more tests mid-batch. Tests were stored at increasing addresses starting at the bottom of the buffer, while results were stored at decreasing addresses starting at the top. As a background task, tests after the current one would be copied to the start of the buffer (since there was only one pointer that was used to read tests for processing, the copy logic would update that pointer as required). Had the application used malloc/free, it's possible that the interleaving of allocations for tests and results could have fragmented memory, but with the system used there was no such risk.
Echoing advice to measure first and only specialize if performance sucks - should be easy to abstract your malloc/free/reallocs such that replacement is straightforward.
Given the specialized platform I can't comment on effectiveness of the runtimes. If you do investigate your own then object pooling (see other answers) or small object allocation a la Loki or this is worth a look. The second link has some interesting commentary on the issue as well.

Resources