I'm optimizing shared memory in a CUDA kernel, so I need to determine which variables are the best candidates (that is, most often accessed) to be stored in shared memory. I know I can page through the code and count the number of times each variable is accessed, but the kernel is rather complicated, so I'm hoping there's a way to automate this. Can I have GDB count the number of times each variable is accessed in general CPU code or specifically in cuda-gdb? Or are there other profiling/debugging tools that could be useful?
Thanks.
I'm not aware of any performance metrics capable to count the number of times a specific variable is accessed by the involved threads. Perhaps you should have a look at the disassembled microcode (by cuobjdump with --dump-sass option) to be sure on how many times this occurs.
Consider, though, that in modern architectures (e.g., Fermi and Kepler), shared memory can be seen as a "controlled L1 cache", so its use may be not necessary if the variables are not evicted from L1. To have an idea on how frequently the global memory variables are evicted from L1 cache, you may have a look at some performance metrics of nvprof if you do not want to count the access number manually. For example, you may consider global_cache_replay_overhead, gld_efficiency, gst_efficiency etc. You may find the full list of performance metrics at Metrics Reference.
Finally, as suggested by #talonmies, you may wish to consider using registers instead of shared memory for some frequently used variables to have an even faster access.
Related
Cache is controlled by cache hardware transparently to processor, so if we use volatile variables in C program, how is it guaranteed that my program reads data each time from the actual memory address specified but not cache.
My understanding is that,
Volatile keyword tells compiler that the variable references shouldn't be optimized and should be read as programmed in the code.
Cache is controlled by cache hardware transparently, hence when processor issues an address, it doesn't know whether the data is coming from cache or the memory.
So, if I have a requirement of having to read a memory address every time required, how can I make sure that its not referred from cache but from required address?
Some how, these two concepts are not fitting together well. Please clarify how its done.
(Imagining we have write-back policy in cache (if required for analyzing the problem))
Thank you,
Microkernel :)
Firmware developer here. This is a standard problem in embedded programming, and one that trips up many (even very experienced) developers.
My assumption is that you are attempting to access a hardware register, and that register value can change over time (be it interrupt status, timer, GPIO indications, etc.).
The volatile keyword is only part of the solution, and in many cases may not be necessary. This causes the variable to be re-read from memory each time it is used (as opposed to being optimized out by the compiler or stored in a processor register across multiple uses), but whether the "memory" being read is an actual hardware register versus a cached location is unknown to your code and unaffected by the volatile keyword. If your function only reads the register once then you can probably leave off volatile, but as a general rule I will suggest that most hardware registers should be defined as volatile.
The bigger issue is caching and cache coherency. The easiest approach here is to make sure your register is in uncached address space. That means every time you access the register you are guaranteed to read/write the actual hardware register and not cache memory. A more complex but potentially better performing approach is to use cached address space and have your code manually force cache updates for specific situations like this. For both approaches, how this is accomplished is architecture-dependent and beyond the scope of the question. It could involve MTRRs (for x86), MMU, page table modifications, etc.
Hope that helps. If I've missed something, let me know and I'll expand my answer.
From your question there is a misconception on your part.
Volatile keyword is not related to the cache as you describe.
When the keyword volatile is specified for a variable, it gives a hint to the compiler not to do certain optimizations as this variable can change from other parts of the program unexpectedly.
What is meant here, is that the compiler should not reuse the value already loaded in a register, but access the memory again as the value in register is not guaranteed to be the same as the value stored in memory.
The rest concerning the cache memory is not directly related to the programmer.
I mean the synchronization of any cache memory of CPU with the RAM is an entirely different subject.
My suggestion is to mark the page as non-cached by the virtual memory manager.
In Windows, this is done through setting PAGE_NOCACHE when calling VirtualProtect.
For a somewhat different purpose, the SSE 2 instructions have the _mm_stream_xyz instructions to prevent cache pollution, although I don't think they apply to your case here.
In either case, there is no portable way of doing what you want in C; you have to use OS functionality.
Wikipedia has a pretty good article about MTRR (Memory Type Range Registers) which apply to the x86 family of CPUs.
To summarize it, starting with the Pentium Pro Intel (and AMD copied) had these MTR registers which could set uncached, write-through, write-combining, write-protect or write-back attributes on ranges of memory.
Starting with the Pentium III but as far as I know, only really useful with the 64-bit processors, they honor the MTRRs but they can be overridden by the Page Attribute Tables which let the CPU set a memory type for each page of memory.
A major use of the MTRRs that I know of is graphics RAM. It is much more efficient to mark it as write-combining. This lets the cache store up the writes and it relaxes all of the memory write ordering rules to allow very high-speed burst writes to a graphics card.
But for your purposes you would want either a MTRR or a PAT setting of either uncached or write-through.
As you say cache is transparent to the programmer. The system guarantees that you always see the value that was last written to if you access an object through its address. The "only" thing that you may incur if an obsolete value is in your cache is a runtime penalty.
volatile makes sure that data is read everytime it is needed without bothering with any cache between CPU and memory. But if you need to read actual data from memory and not cached data, you have two options:
Make a board where said data is not cached. This may already be the case if you address some I/O device,
Use specific CPU instructions that bypass the cache. This is used when you need to scrub memory for activating possible SEU errors.
The details of second option depend on OS and/or CPU.
using the _Uncached keyword may help in embedded OS , like MQX
#define MEM_READ(addr) (*((volatile _Uncached unsigned int *)(addr)))
#define MEM_WRITE(addr,data) (*((volatile _Uncached unsigned int *)(addr)) = data)
I am new to computer programming. I was studying about variables and came across a definition on the internet:
Variables are the names you give to computer memory locations which are used to store values in a computer program.
What are these memory locations? Do these locations refer to the actual computer memory or this is just a dump in the program itself from where it calls those variables later when we need them?
Also there are other terms that I encountered here on stack overflow like heap and stack. I could not get my head around these. Please help.
The way you've asked the question suggests you expect a single answer. That is simply not the case.
In a rough sense, all variables will exist in memory while your program is being executed. The memory your variables exist in depends both on several things.
Modern computer hardware often has quite a complex physical memory architecture - with multiple levels of cache (in both the CPU, and various peripheral devices), a number of CPU registers, shared memory, different types of RAM, storage devices, EEPROMs, etc. Different systems have these types of memory - and more types - in different proportions.
Operating systems may make memory available to your program in different ways. For example, it may provide virtual memory, using a combination of RAM and reserved hard drive space (and managing mappings, so your program can't tell the difference). This can allow your program to use more memory than is physically available as RAM, but also affects performance, since the operating system must swap memory usage of your program between RAM and the hard drive (which is typically orders of magnitude slower).
A lot of compilers and libraries are implemented to maximise your programs performance (by various measures) - compiler optimisation of your code (which can cause some variables in your code to not even exist when your program is run), library functions crafted for performance, etc. One consequence of this is that the compiler, or library, may use memory in different ways (e.g. some implementations may embed code in your executable to detect memory resources available when the program is run, others may simply assume a fixed amount of RAM), and the usage may even vary over time.
On the Intel x86 platform running Linux, in C/C++, how can I tell the OS and the hardware to store a value (such as a uint32) in L1/L2 cache, and not in system memory? For example, let's say either for security or performance reasons, I don't want to store a 32-bit key (a 32-bit unsigned int) in DRAM, and instead I would like to store it only in the processor's cache. How can I do this? I'm using Fedora 16 (Linux 3.1 and gcc 4.6.2) on an Intel Xeon processor.
Many thanks in advance for your help!
I don't think you are able to force a variable to be stored in the processor's cache, but you can use the register keyword to suggest to the compiler that a given variable should be allocated into a CPU register, declaring it like:
register int i;
There are no cpu instructions on x86 (or indeed any platform that I'm aware of) that will allow you to force the CPU to keep something in the L1/L2 cache. Let alone exposing such an extremely low level detail to the higher level languages like C/C++.
Saying you need to do this for "performance" is meaningless without more context of what sort of performance you're looking at. Why is your program so tightly dependent on having access to data in cache alone. Saying you need this for security just seems like bad security design. In either case, you have to provide a lot more detail of what exactly you're trying to do here.
Short answer, you can't - that is not what those caches are for - they are fed from main memory, to speed up access, or allow for advanced techniques like branch prediction and pipelining.
There are ways to ensure the caches are used for certain data, but it will still reside in ram, and in a pre-emptive multitasking operating system, you cannot guarantee that your cache contents will not be blown away through a context switch between any two instructions, except by 'stopping the world', or low level atomic operations, but they are generally for very, very, very short sequences of instructions that simply cannot not be interrupted, like increment and fetch for spinlocks, not processing cryptographic algorithms in one go.
You can't use cache directly but you can use hardware registers for integers, and they are faster.
If you really want performance, a variable is better off in a CPU register.
If you cannot use a register, for example because you need to share the same value across different threads or cores (multicore is getting common now!), you need to store the variable into memory.
As already mentioned, you cannot force some memory into the cache using a call or keyword.
However, caches aren't entirely stupid: if you memory block is used often enough you shouldn't have a problem to keep it in the cache.
Keep in mind that if you happen to write to this memory place a lot from different cores, you're going to strain the cache coherency blocks in the processor because they need to make sure that all the caches and the actual memory below are kept in sync.
Put simply, this would reduce overall performance of the CPU.
Note that the opposite (do not cache) does exist as a property you can assign to parts of your heap memory.
My function will be called thousands of times. If i want to make it faster, will changing the local function variables to static be of any use? My logic behind this is that, because static variables are persistent between function calls, they are allocated only the first time, and thus, every subsequent call will not allocate memory for them and will become faster, because the memory allocation step is not done.
Also, if the above is true, then would using global variables instead of parameters be faster to pass information to the function every time it is called? i think space for parameters is also allocated on every function call, to allow for recursion (that's why recursion uses up more memory), but since my function is not recursive, and if my reasoning is correct, then taking off parameters will in theory make it faster.
I know these things I want to do are horrible programming habits, but please, tell me if it is wise. I am going to try it anyway but please give me your opinion.
The overhead of local variables is zero. Each time you call a function, you are already setting up the stack for the parameters, return values, etc. Adding local variables means that you're adding a slightly bigger number to the stack pointer (a number which is computed at compile time).
Also, local variables are probably faster due to cache locality.
If you are only calling your function "thousands" of times (not millions or billions), then you should be looking at your algorithm for optimization opportunities after you have run a profiler.
Re: cache locality (read more here):
Frequently accessed global variables probably have temporal locality. They also may be copied to a register during function execution, but will be written back into memory (cache) after a function returns (otherwise they wouldn't be accessible to anything else; registers don't have addresses).
Local variables will generally have both temporal and spatial locality (they get that by virtue of being created on the stack). Additionally, they may be "allocated" directly to registers and never be written to memory.
The best way to find out is to actually run a profiler. This can be as simple as executing several timed tests using both methods and then averaging out the results and comparing, or you may consider a full-blown profiling tool which attaches itself to a process and graphs out memory use over time and execution speed.
Do not perform random micro code-tuning because you have a gut feeling it will be faster. Compilers all have slightly different implementations of things and what is true on one compiler on one environment may be false on another configuration.
To tackle that comment about fewer parameters: the process of "inlining" functions essentially removes the overhead related to calling a function. Chances are a small function will be automatically in-lined by the compiler, but you can suggest a function be inlined as well.
In a different language, C++, the new standard coming out supports perfect forwarding, and perfect move semantics with rvalue references which removes the need for temporaries in certain cases which can reduce the cost of calling a function.
I suspect you're prematurely optimizing, however, you should not be this concerned with performance until you've discovered your real bottlenecks.
Absolutly not! The only "performance" difference is when variables are initialised
int anint = 42;
vs
static int anint = 42;
In the first case the integer will be set to 42 every time the function is called in the second case ot will be set to 42 when the program is loaded.
However the difference is so trivial as to be barely noticable. Its a common misconception that storage has to be allocated for "automatic" variables on every call. This is not so C uses the already allocated space in the stack for these variables.
Static variables may actually slow you down as its some aggresive optimisations are not possible on static variables. Also as locals are in a contiguous area of the stack they are easier to cache efficiently.
There is no one answer to this. It will vary with the CPU, the compiler, the compiler flags, the number of local variables you have, what the CPU's been doing before you call the function, and quite possibly the phase of the moon.
Consider two extremes; if you have only one or a few local variables, it/they might easily be stored in registers rather than be allocated memory locations at all. If register "pressure" is sufficiently low that this may happen without executing any instructions at all.
At the opposite extreme there are a few machines (e.g., IBM mainframes) that don't have stacks at all. In this case, what we'd normally think of as stack frames are actually allocated as a linked list on the heap. As you'd probably guess, this can be quite slow.
When it comes to accessing the variables, the situation's somewhat similar -- access to a machine register is pretty well guaranteed to be faster than anything allocated in memory can possible hope for. OTOH, it's possible for access to variables on the stack to be pretty slow -- it normally requires something like an indexed indirect access, which (especially with older CPUs) tends to be fairly slow. OTOH, access to a global (which a static is, even though its name isn't globally visible) typically requires forming an absolute address, which some CPUs penalize to some degree as well.
Bottom line: even the advice to profile your code may be misplaced -- the difference may easily be so tiny that even a profiler won't detect it dependably, and the only way to be sure is to examine the assembly language that's produced (and spend a few years learning assembly language well enough to know say anything when you do look at it). The other side of this is that when you're dealing with a difference you can't even measure dependably, the chances that it'll have a material effect on the speed of real code is so remote that it's probably not worth the trouble.
It looks like the static vs non-static has been completely covered but on the topic of global variables. Often these will slow down a programs execution rather than speed it up.
The reason is that tightly scoped variables make it easy for the compiler to heavily optimise, if the compiler has to look all over your application for instances of where a global might be used then its optimising won't be as good.
This is compounded when you introduce pointers, say you have the following code:
int myFunction()
{
SomeStruct *A, *B;
FillOutSomeStruct(B);
memcpy(A, B, sizeof(A);
return A.result;
}
the compiler knows that the pointer A and B can never overlap and so it can optimise the copy. If A and B are global then they could possibly point to overlapping or identical memory, this means the compiler must 'play it safe' which is slower. The problem is generally called 'pointer aliasing' and can occur in lots of situations not just memory copies.
http://en.wikipedia.org/wiki/Pointer_alias
Using static variables may make a function a tiny bit faster. However, this will cause problems if you ever want to make your program multi-threaded. Since static variables are shared between function invocations, invoking the function simultaneously in different threads will result in undefined behaviour. Multi-threading is the type of thing you may want to do in the future to really speed up your code.
Most of the things you mentioned are referred to as micro-optimizations. Generally, worrying about these kind of things is a bad idea. It makes your code harder to read, and harder to maintain. It's also highly likely to introduce bugs. You'll likely get more bang for your buck doing optimizations at a higher level.
As M2tM suggests, running a profiler is also a good idea. Check out gprof for one which is quite easy to use.
You can always time your application to truly determine what is fastest. Here is what I understand: (all of this depends on the architecture of your processor, btw)
C functions create a stack frame, which is where passed parameters are put, and local variables are put, as well as the return pointer back to where the caller called the function. There is no memory management allocation here. It usually a simple pointer movement and thats it. Accessing data off the stack is also pretty quick. Penalties usually come into play when you're dealing with pointers.
As for global or static variables, they're the same...from the standpoint that they're going to be allocated in the same region of memory. Accessing these may use a different method of access than local variables, depends on the compiler.
The major difference between your scenarios is memory footprint, not so much speed.
Using static variables can actually make your code significantly slower. Static variables must exist in a 'data' region of memory. In order to use that variable, the function must execute a load instruction to read from main memory, or a store instruction to write to it. If that region is not in the cache, you lose many cycles. A local variable that lives on the stack will most surely have an address that is in the cache, and might even be in a cpu register, never appearing in memory at all.
I agree with the others comments about profiling to find out stuff like that, but generally speaking, function static variables should be slower. If you want them, what you are really after is a global. Function statics insert code/data to check if the thing has been initialized already that gets run every time your function is called.
Profiling may not see the difference, disassembling and knowing what to look for might.
I suspect you are only going to get a variation as much as a few clock cycles per loop (on average depending on the compiler, etc). Sometimes the change will be dramatic improvement or dramatically slower, and that wont necessarily be because the variables home has moved to/from the stack. Lets say you save four clock cycles per function call for 10000 calls on a 2ghz processor. Very rough calculation: 20 microseconds saved. Is 20 microseconds a lot or a little compared to your current execution time?
You will likely get more a performance improvement by making all of your char and short variables into ints, among other things. Micro-optimization is a good thing to know but takes lots of time experimenting, disassembling, timing the execution of your code, understanding that fewer instructions does not necessarily mean faster for example.
Take your specific program, disassemble both the function in question and the code that calls it. With and without the static. If you gain only one or two instructions and this is the only optimization you are going to do, it is probably not worth it. You may not be able to see the difference while profiling. Changes in where the cache lines hit could show up in profiling before changes in the code for example.
I just had a phone interview where I was asked this question. I am aware of ways to store in register or heap or stack, but cache specifically?
Not in C as a language. In GCC as a compiler - look for __builtin_prefetch.
You might be interested in reading What every programmer should know about memory.
Edit:
Just to clear some confusion - caches are physically separate memories in hardware, but not in software abstraction of the machine. A word in a cache is always associated with address in main memory. This is different from the CPU registers, which are named/addressed separately from the RAM.
In C, as in as defined by the C standard? No.
In C, as in some specific implementation on a specific platform? Maybe.
As cache is a CPU concept and is meaningless for C language (and C language has targets processors that have no cache, unlikely today, but quite common in old days) definitely No.
Trying to optimize such things by hand is also usually a quite bad idea.
What you can do is keep the job easy for the compiler keeping loops very short and doing only one thing (good for instruction cache), iterate on memory blocks in the right order (prefer accesses to consecutive cells in memory to sparse accesses), avoid reusing the same variables for different uses (it introduces read-after-write dependencies), etc. If you are attentive to such details the program is more likely to be efficiently optimized by compiler and memory accesses be cached.
But it will still depend on actual hardware and even compiler may not guarantee it.
It depends on the platform, so if you were speaking to a company targetting current generation consoles, you would need to know the PowerPC data cache intrinsics/instructions. On various platforms, you would also need to know the false sharing rules. Also, you can't cache from memory marked explicitly as uncached.
Without more context about the actual job or company or question, this would probably be best answered by talking about what not to do to keep memory references in the data cache.
If you are trying to force something to be stored in the CPU cache, I would recommend that you avoid trying to do so unless you have an overwhelmingly good reason. Manually manipulating the CPU cache can have all sorts of unintended consequences, not the least among them being coherency in multi-core or multi-CPU applications. This is something that is done by the CPU at run-time and is generally transparent to the programmer and the compiler for a good reason.
The specific answer will depend on your compiler and platform. If you are targeting a MIPS architecture, there is a CACHE instruction (assembly) which allows you to do CPU cache manipulations.