How to use software prefetch systematically? - c

After reading the accepted answer in When should we use prefetch? and examples from Prefetching Examples?, I still have a lot of issues with understanding when to actually use prefetch. While those answers provide an example where prefetch is helpful, they do not explain how to discover it in real programs. It looks like random guessing.
In particular, I am interested in the C implementations for intel x86 (prefetchnta, prefetcht2, prefetcht1, prefetcht0, prefetchw) that are accessible through GCC's __builtin_prefetch intrinsic. I would like to know:
How can I see that software prefetch can help for my specific program? I imagine that I can collect CPU profiling metrics (e.g. number of cache misses) with Intel Vtune or Linux utility perf. In this case what metrics (or relation between them) indicate the opportunity to improve performance with software prefetching?
How I can locate the loads that suffer from cache misses the most?
How to see the cache level where misses happen to decide which prefetch(0,1,2) to use?
Assuming I found a particular load that suffers from the miss in a specific cache level, where should I place prefetch? As an example, assume that the next loop suffers from cache misses
for (int i = 0; i < n; i++) {
// some code
double x = a[i];
// some code
}
Should I place prefetch before or after the load a[i]? How far ahead it should point a[i+m]? Do I need to worry about unrolling the loop to make sure that I am prefetching only on cache line boundaries or it will be almost free like a nop if data is already in cache? Is it worth to use multiple __builtin_prefetch calls in a row to prefetch multiple cache lines at once?

How can I see that software prefetch can help for my specific program?
You can check the proportion of cache misses. perf or VTune can be used to get this information thanks to hardware performance counters. You can get the list with perf list for example. The list is dependent of the target processor architecture but there are some generic events. For example, L1-dcache-load-misses, LLC-load-misses and LLC-store-misses. Having the amount of cache misses is not very useful unless you also get the number of load/store. There are generic counters like L1-dcache-loads, LLC-loads or LLC-stores. AFAIK, for the L2, there is no generic counters (at least on Intel processors) and you need to use specific hardware counters (for example l2_rqsts.miss on Intel Skylake-like processors). To get the overall statistics, you can use perf stat -e an_hardware_counter,another_one your_program. A good documentation can be found here.
When the proportion of misses is big, then you should try to optimize the code, but this is just a hint. In fact, regarding your application, you can have a lot of cache hit but many cache misses in critical part/time of your application. As a result, cache misses can be lost among all the others. This is especially true for the L1 cache references that are massive in scalar codes compared to SIMD ones. One solution is to profile only specific portion of your application and use the knowledge of it so to investigate in the good direction. Performance counters are not really a tool to automatically search issues in your program, but a tool to assist you in validating/disproving some hypothesis or to give some hints about what is happening. It gives you evidences to solve a mysterious case but it is up to you, the detective, to do all the work.
How I can locate the loads that suffer from cache misses the most?
Some hardware performance counters are "precise" meaning that the instruction that has generated the event can be located. This is very useful since you can tell which instructions are responsible for the most cache misses (though it is not always precise in practice). You can use perf record + perf report so to get the information (see the previous tutorial for more information).
Note that there are many reasons that can cause a cache misses and only few cases can be solved by using software prefetching.
How to see the cache level where misses happen to decide which prefetch(0,1,2) to use?
This is often difficult to choose in practice and very dependent of your application. Theoretically, the number is an hint to tell to the processor if the level of locality of the target cache line (eg. fetched into the L1, L2 or L3 cache). For example, if you know that data should be read and reused soon, it is a good idea to put it in the L1. However, if the L1 is used and you do not want to pollute it with data used only once (or rarely used), it is better to fetch data into lower caches. In practice, it is a bit complex since the behavior may not be the same from one architecture to another... See What are _mm_prefetch() locality hints? for more information.
An example of usage is for this question. Software prefetching was used to avoid cache trashing issue with some specific strides. This is a pathological case where the hardware prefetcher is not very useful.
Assuming I found a particular load that suffers from the miss in a specific cache level, where should I place prefetch?
This is clearly the most tricky part. You should prefetch the cache lines sufficiently early so for the latency to be significantly reduced, otherwise the instruction is useless and can actually be detrimental. Indeed, the instruction takes some space in the program, need to be decoded, and use load ports that could be used to execute other (more critical) load instructions for example. However, if it is too late, then the cache line can be evicted and need to be reloaded...
The usual solution is to write a code like this:
for (int i = 0; i < n; i++) {
// some code
const size_t magic_distance_guess = 200;
__builtin_prefetch(&data[i+magic_distance_guess]);
double x = a[i];
// some code
}
Where magic_distance_guess is a value generally set based on benchmarks (or a very deep understanding of the target platform though the practice often shows even highly-skilled developers fail to find the best value).
The thing is the latency is very dependent of where data are coming from and the target platform. In most case, developers cannot really know exactly when to do the prefetching unless they work on a unique given target platform. This makes software prefetching tricky to use and often detrimental when the target platform changes (one has to consider the maintainability of the code and the overhead of the instruction). Not to mention that built-ins are compiler-dependent, prefetching intrinsics are architecture-dependent and there is no standard portable way to use software prefetching.
Do I need to worry about unrolling the loop to make sure that I am prefetching only on cache line boundaries or it will be almost free like a nop if data is already in cache?
Yes, prefetching instructions are not free and so it is better to use only 1 instruction per cache line (as other prefetching instruction on the same cache line will be useless).
Is it worth to use multiple __builtin_prefetch calls in a row to prefetch multiple cache lines at once?
This is very dependent of the target platform. Modern mainstream x86-64 processors execute instructions in an out-of-order way in parallel and they have a pretty huge window of instruction analyzed. They tends to execute load as soon as possible so to avoid misses and they are often very good for such job.
In your example loop, I expect the hardware prefetcher should do a very good job and using software prefetching should be slower on a (relatively recent) mainstream processor.
Software prefetching was useful when hardware prefetchers was not very smart a decade ago but they tends to be very good nowadays. Additionally, it is often better to guide hardware prefetchers than to use software prefetching instructions since the former have a lower overhead. This is why software prefetching is discouraged (eg. by Intel and most developers) unless you really know what you are doing.

How to use software prefetch systematically?
The quick answer is: don't.
As you correctly analyzed, prefetching is a tricky and advanced optimisation technique that is not portable and rarely useful.
You can use profiling to determine what sections of code form a bottleneck and use specialized tools such as valgrind to try and identify cache misses that could potentially be avoided using compiler builtins.
Don't expect too much from this, but do profile the code to concentrate your optimizing efforts where it can be useful.
Remember also that a better algorithm can beat an optimized implementation of a less efficient one for large datasets.

Related

large performance drop with gcc, maybe related to inline

I'm currently experiencing some weird effect with gcc (tested version: 4.8.4).
I've got a performance oriented code, which runs pretty fast. Its speed depends for a large part on inlining many small functions.
Since inlining across multiple .c files is difficult (-flto is not yet widely available), I've kept a lot of small functions (typically 1 to 5 lines of code each) into a common C file, into which I'm developing a codec, and its associated decoder. It's "relatively" large by my standard (about ~2000 lines, although a lot of them are just comments and blank lines), but breaking it into smaller parts opens new problems, so I would prefer to avoid that, if that is possible.
Encoder and Decoder are related, since they are inverse operations. But from a programming perspective, they are completely separated, sharing nothing in common, except a few typedef and very low-level functions (such as reading from unaligned memory position).
The strange effect is this one:
I recently added a new function fnew to the encoder side. It's a new "entry point". It's not used nor called from anywhere within the .c file.
The simple fact that it exists makes the performance of the decoder function fdec drops substantially, by more than 20%, which is way too much to be ignored.
Now, keep in mind than encoding and decoding operations are completely separated, and share almost nothing, save some minor typedef (u32, u16 and such) and associated operations (read/write).
When defining the new encoding function fnew as static, performance of the decoder fdec increases back to normal. Since fnew isn't called from the .c, I guess it's the same as if it was not there (dead code elimination).
If static fnew is now called from the encoder side, performance of fdec remains strong.
But as soon as fnew is modified, fdec performance just drops substantially.
Presuming fnew modifications crossed a threshold, I increased the following gcc parameter: --param max-inline-insns-auto=60 (by default, its value is supposed to be 40.) And it worked : performance of fdec is now back to normal.
And I guess this game will continue forever with each little modification of fnew or anything else similar, requiring further tweak.
This is just plain weird. There is no logical reason for some little modification in function fnew to have knock-on effect on completely unrelated function fdec, which only relation is to be in the same file.
The only tentative explanation I could invent so far is that maybe the simple presence of fnew is enough to cross some kind of global file threshold which would impact fdec. fnew can be made "not present" when it's: 1. not there, 2. static but not called from anywhere 3. static and small enough to be inlined. But it's just hiding the problem. Does that mean I can't add any new function?
Really, I couldn't find any satisfying explanation anywhere on the net.
I was curious to know if someone already experienced some equivalent side-effect, and found a solution to it.
[Edit]
Let's go for some more crazy test.
Now I'm adding another completely useless function, just to play with. Its content is strictly exactly a copy-paste of fnew, but the name of the function is obviously different, so let's call it wtf.
When wtf exists, it doesn't matter if fnew is static or not, nor what is the value of max-inline-insns-auto: performance of fdec is back to normal.
Even though wtf is not used nor called from anywhere... :'(
[Edit 2]
there is no inline instruction. All functions are either normal or static. Inlining decision is solely within compiler's realm, which has worked fine so far.
[Edit 3]
As suggested by Peter Cordes, the issue is not related to inline, but to instruction alignment. On newer Intel cpus (Sandy Bridge and later), hot loop benefit from being aligned on 32-bytes boundaries.
Problem is, by default, gcc align them on 16-bytes boundaries. Which gives a 50% chance to be on proper alignment depending on length of previous code. Hence a difficult to understand issue, which "looks random".
Not all loop are sensitive. It only matters for critical loops, and only if their length make them cross one more 32-bytes instruction segment when being less ideally aligned.
Turning my comments into an answer, because it was turning into a long discussion. Discussion showed that the performance problem is sensitive to alignment.
There are links to some perf-tuning info at https://stackoverflow.com/tags/x86/info, include Intel's optimization guide, and Agner Fog's very excellent stuff. Some of Agner Fog's assembly optimization advice doesn't fully apply to Sandybridge and later CPUs. If you want the low-level details on a specific CPU, though, the microarch guide is very good.
Without at least an external link to code that I can try myself, I can't do more than handwave. If you don't post the code anywher, you're going to need to use profiling / CPU performance counter tools like Linux perf or Intel VTune to track this down in a reasonable amount of time.
In chat, the OP found someone else having this issue, but with code posted. This is probably the same issue the OP is seeing, and is one of the major ways code alignment matters for Sandybridge-style uop caches.
There's a 32B boundary in the middle of the loop in the slow version. The instructions that start before the boundary decode to 5 uops. So in the first cycle, the uop cache serves up mov/add/movzbl/mov. In the 2nd cycle, there's only a single mov uop left in the current cache line. Then the 3rd cycle cycle issues the last 2 uops of the loop: add and cmp+ja.
The problematic mov starts at 0x..ff. I guess instructions that span a 32B boundary go into (one of) the uop cacheline(s) for their starting address.
In the fast version, an iteration only takes 2 cycles to issue: The same first cycle, then mov / add / cmp+ja in the 2nd.
If one of the first 4 instructions had been one byte longer (e.g. padded with a useless prefix, or a REX prefix), there would be no problem. There wouldn't be an odd-man-out at the end of the first cacheline, because the mov would start after the 32B boundary and be part of the next uop cache line.
AFAIK, assemble & check disassembly output is the only way to use longer versions of the same instructions (see Agner Fog's Optimizing Assembly) to get 32B boundaries at multiples of 4 uops. I'm not aware of a GUI that shows alignment of assembled code as you're editing. (And obviously, doing this only works for hand-written asm, and is brittle. Changing the code at all will break the hand-alignment.)
This is why Intel's optimization guide recommends aligning critical loops to 32B.
It would be really cool if an assembler had a way to request that preceding instructions be assembled using longer encodings to pad out to a certain length. Maybe a .startencodealign / .endencodealign 32 pair of directives, to apply padding to code between the directives to make it end on a 32B boundary. This could make terrible code if used badly, though.
Changes to the inlining parameter will change the size of functions, and bump other code over by multiples 16B. This is a similar effect to changing the contents of a function: it gets bigger and changes the alignment of other functions.
I was expecting the compiler to always make sure a function starts at
ideal aligned position, using noop to fill gaps.
There's a tradeoff. It would hurt performance to align every function to 64B (the start of a cache line). Code density would go down, with more cache lines needed to hold the instructions. 16B is good, because it's the instruction fetch/decode chunk size on most recent CPUs.
Agner Fog has the low-level details for each microarch. He hasn't updated it for Broadwell, though, but the uop cache probably hasn't changed since Sandybridge. I assume there's one fairly small loop that dominates the runtime. I'm not sure exactly what to look for first. Maybe the "slow" version has some branch targets near the end of a 32B block of code (and hence near the end of a uop cacheline), leading to significantly less than 4 uops per clock coming out of the frontend.
Look at performance counters for the "slow" and "fast" versions (e.g. with perf stat ./cmd), and see if any are different. e.g. a lot more cache misses could indicate false sharing of a cache line between threads. Also, profile and see if there's a new hotspot in the "slow" version. (e.g. with perf record ./cmd && perf report on Linux).
How many uops/clock is the "fast" version getting? If it's above 3, frontend bottlenecks (maybe in the uop cache) that are sensitive to alignment could be the issue. Either that or L1 / uop-cache misses if different alignment means your code needs more cache lines than are available.
Anyway, this bears repeating: use a profiler / performance counters to find the new bottleneck that the "slow" version has, but the "fast" version doesn't. Then you can spend time looking at the disassembly of that block of code. (Don't look at gcc's asm output. You need to see the alignment in the disassembly of the final binary.) Look at the 16B and 32B boundaries, since presumably they'll be in different places between the two versions, and we think that's the cause of the problem.
Alignment can also make macro-fusion fail, if a compare/jcc splits a 16B boundary exactly. Although that is unlikely in your case, since your functions are always aligned to some multiple of 16B.
re: automated tools for alignment: no, I'm not aware of anything that can look at a binary and tell you anything useful about alignment. I wish there was an editor to show groups of 4 uops and 32B boundaries alongside your code, and update as you edit.
Intel's IACA can sometimes be useful for analyzing a loop, but IIRC it doesn't know about taken branches, and I think doesn't have a sophisticated model of the frontend, which is obviously the issue if misalignment breaks performance for you.
In my experience, the performance drops may be caused by disabling inlining optimization.
The 'inline' modifier doesn't indicate to force a function to be inlined. It gives compilers a hint to inline a function. So when the compiler's criteria of inlining optimizaion will not be satisfied by trivial modifications of code, a function which is modified with inline is normally compiled to a static function.
And there is a thing make the problem more complex, nested inline optimizations. If you have a inline function, fA, that calls a inline function, fB, like this:
inline void fB(int x, int y) {
return x * y;
}
inline void fA() {
for(int i = 0; i < 0x10000000; ++i) {
fB(i, i+1);
}
}
void main() {
fA();
}
In this case, we expect that both fA and fB are inlined. But if the inlining criteia is not met, the performance can't be predictable. That is, large performance drops occur when inlining is disable about fB, but very slight drops for fA. And you know, compiler's internal decisions are much complex.
The reasons cause disabling inlining, for example, size of inlining function, size of .c file, number of local variables, and so on.
Actually, in C#, I am experienced this performance drops. In my case, 60% performance drop occurs when one local variable is added to a simple inlining function.
EDIT:
You can investigate what happens by reading compiled assembly code. I guess there are unexpected real callings to functions modified with 'inline'.

Is a pointer indirection more costly than a conditional?

Is a pointer indirection (to fetch a value) more costly than a conditional?
I've observed that most decent compilers can precompute a pointer indirection to varying degrees--possibly removing most branching instructions--but what I'm interested in is whether the cost of an indirection is greater than the cost of a branch point in the generated code.
I would expect that if the data referenced by the pointer is not in a cache at runtime that a cache flush might occur, but I don't have any data to back that.
Does anyone have solid data (or a justifiable opinion) on the matter?
EDIT: Several posters noted that there is no "general case" on the cost of branching: it varies wildly from chip to chip.
If you happen to know of a notable case where branching would be cheaper (with or without branch prediction) than an in-cache indirection, please mention it.
This is very much dependant on the circumstances.
1 How often is the data in cache (L1, L2, L3) or and how often it must be fetched all the way from the RAM?
A fetch from RAM will take around 10-40ns. Of course, that will fill a whole cache-line in little more than that, so if you then use the next few bytes as well, it will definitely not "hurt as bad".
2 What processor is it?
Older Intel Pentium4 were famous for their long pipeline stages, and would take 25-30 clockcycles (~15ns at 2GHz) to "recover" from a branch that was mispredicted.
3 How "predictable" is the condition?
Branch prediction really helps in modern processors, and they can cope quite well with "unpredictable" branches too, but it does hurt a little bit.
4 How "busy" and "dirty" is the cache?
If you have to throw out some dirty data to fill the cache-line, it will take another 15-50ns on top of the "fetch the data in" time.
The indirection itself will be a fast instruction, but of course, if the next instruction uses the data immediately after, you may not be able to execute that instruction immediately - even if the data is in L1 cache.
On a good day (well predicted, target in cache, wind in the right direction, etc), a branch, on the other hand, takes 3-7 cycles.
And finally, of course, the compiler USUALLY knows quite well what works best... ;)
In summary, it's hard to say for sure, and the only way to tell what is better IN YOUR case would be to benchmark alternative solutions. I would thin that an indirect memory access is faster than a jump, but without seeing what code your source compiles to, it's quite hard to say.
It would really depend on your platform. There is no one right answer without looking at the innards of the target CPU. My advice would be to measure it both ways in a test app to see if there is even a noticeable difference.
My gut instinct would be that on a modern CPU, branching through a function pointer and conditional branching both rely on the accuracy of the branch predictor, so I'd expect similar performance from the two techniques if the predictor is presented with similar workloads. (i.e. if it always ends up branching the same way, expect it to be fast; if it's hard to predict, expect it to hurt.) But the only way to know for sure is to run a real test on your target platform.
It depends from processor to processor, but depending on the set of data you're working with, a pipeline flush caused by a mispredicted branch (or badly ordered instructions in some cases) can be more damaging to the speed than a simple cache miss.
In the PowerPC case, for instance, branches not taken (but predicted to be taken) cost about 22 cycles (the time taken to re-fill the pipeline), while a L1 cache miss may cost 600 or so memory cycles. However, if you're going to access contiguous data, it may be better to not branch and let the processor cache-miss your data at the cost of 3 cycles (branches predicted to be taken and taken) for every set of data you're processing.
It all boils down to: test it yourself. The answer is not definitive for all problems.
Since the processor would have to predict the conditional answer in order to plan which instruction has more chances of having to be executed, I would say that the actual cost of the instructions is not important.
Conditional instructions are bad efficiency wise because they make the process flow unpredictable.

Why instrumented C program runs faster?

I am working on a (quite large) existing monothreaded C application. In this context I modified the application to perform some very few additional work consisting in incrementing a counter each time we call a special function (this function is called ~ 80.000 times). The application is compiled on an Ubuntu 12.04 running a 64 bits Linux kernel 3.2.0-31-generic with -O3 option.
Surprisingly the instrumented version of the code is running faster and I am investigating why.I measure execution time with clock_gettime(CLOCK_PROCESS_CPUTIME_ID) and to get representative results, I am reporting an average execution time value over 100 runs. Moreover, to avoid interference from outside world, I tried as much as possible to launch the application in a system without any other applications running (on a side note, because CLOCK_PROCESS_CPUTIME_ID returns process time and not wall clock time, other applications "should" in theory only affect cache and not directly the process execution time)
I was suspecting "instruction cache effects", maybe the instrumented code that is a little bit larger (few bytes) fits differently and better in the cache, is this hypothesis conceivable ? I tried to do some cache investigations with valegrind --tool=cachegrind but unfortunately, the instrumented version has (as it seems logical) more cache misses than the initial version.
Any hints on this subject and ideas that may help to find why instrumented code is running faster are welcomes (some GCC optimizations available in one case and not in the other, why ?, ...)
Since there are not many details in the question, I can only recommend some factors to consider while investigating the problem.
Very few additional work (such as incrementing a counter) might alter compiler's decision on whether to apply some optimizations or not. Compiler has not always enough information to make perfect choice. It may try to optimize for speed where bottleneck is code size. It may try to auto-vectorize computations when there is not too much data to process. Compiler may not know what kind of data is to be processed or what is the exact model of CPU, that will execute the code.
Incrementing a counter may increase size of some loop and prevent loop unrolling. This may decrease code size (and improve code locality, which is good for instruction or microcode caches or for loop buffer and allows CPU to fetch/decode instructions quickly).
Incrementing a counter may increase size of some function and prevent inlining. This also may decrease code size.
Incrementing a counter may prevent auto-vectorization, which again may decrease code size.
Even if this change does not affect compiler optimization, it may alter the way how the code is executed by CPU.
If you insert counter-incrementing code in place, full of branch targets, this may make branch targets less dense and improve branch prediction.
If you insert counter-incrementing code in front of some particular branch target, this may make branch target's address better aligned and make code fetch faster.
If you place counter-incrementing code after some data is written but before the same data is loaded again (and store-to-load forwarding did not work for some reason), the load operation may be completed earlier.
Insertion of counter-incrementing code may prevent two conflicting load attempts to the same bank in L1 data cache.
Insertion of counter-incrementing code may alter some CPU scheduler decision and make some execution port available just in time for some performance-critical instruction.
To investigate effects of compiler optimization, you can compare generated assembler code before and after addition of counter-incrementing code.
To investigate CPU effects, use a profiler allowing to inspect processor performance counters.
Just guessing from my experience with embedded compilers, Optimization tools in compilers look for recursive tasks. Perhaps the additional code forced the compiler to see something more recursive and it structured the machine code differently. Compilers do some weird things for optimization. In some languages (Perl I think?) a "not not" conditional is faster to execute than a "true" conditional. Does your debugging tool allow you to single step through a code/assembly comparison? This could add some insight as to what the compiler decided to do with the extra tasks.

Prefetching data to cache for x86-64

In my application, at one point I need to perform calculations on a large contiguous block of memory data (100s of MBs). What I was thinking was to keep prefetching the part of the block my program will touch in future, so that when I perform calculations on that portion, the data is already in the cache.
Can someone give me a simple example of how to achieve this with gcc? I read _mm_prefetch somewhere, but don't know how to properly use it. Also note that I have a multicore system, but each core will be working on a different region of memory in parallel.
gcc uses builtin functions as an interface for lowlevel instructions. In particular for your case __builtin_prefetch. But you only should see a measurable difference when using this in cases where the access pattern is not easy to predict automatically.
Modern CPUs have pretty good automatic prefetch and you may well find that you do more harm than good if you try to initiate software prefetching. There is most likely a lot more "low hanging fruit" that you can focus on for optimisation if you find that you actually have a performance problem. Prefetch tends to be one of the last things that you might try, when you're desperate for a few more percent throughput.

Optimizing ARM cache usage for different arrays

I want to port a small piece of code on ARM Cortex A8 processor. Both L1 cache and L2 cache are very limited. There are 3 arrays in my program. Two of them are sequentially accessed(size> Array A: 6MB and Array B: 3MB) and the access pattern for the third array(size> Array C: 3MB) is unpredictable. Though the calculations are not very rigorous but there are huge cache misses for accessing array C. One solution that I thought would be to allocate more cache (L2) space for array C and less for Array A & B. But I'm not able to find any way to achieve this. I went through preload engine of ARM but could not find anything useful.
It would be a good idea to split the cache and allocate each array in a different part of it.
Unfortunately that is not possible. The caches of the CortexA8 just are not that flexible. The good old StrongArm had a secondary cache for exactly this splitting purpose, but it's not available anymore. We have L1 and L2 caches instead (overall a good change imho.)
However, there is a thing you can do:
The NEON SIMD unit of the CortexA8 lags behind the general purpose processing unit by around 10 processor cycles. With clever programming you can issue cache prefetches from the general purpose unit but do the accesses via NEON. The delay between the two pipelines gives the cache a bit of time to do the prefetches, so your average cache miss time will be lower.
The drawback is that if you must never move the result of a calculation back from NEON to the ARM unit. Since NEON lags behind this will cause a full CPU pipeline flush. Almost if not even more costly as a cache miss.
The difference in performance can be significant. Out of the blue I would expect something between 20% and 30% of speed improvement.
From what I could find via Google, it looks like ARMv7 (which is the version of the ISA that Cortex A8 supports) has cache-flush capability, though I couldn't find a clear reference on how to use it -- perhaps you can do better if you spend more time on it than the minute or two I spent typing "ARM cache flush" into a search box and reading the results.
In any case, you should be able to achieve an approximation of what you want by periodically issuing "flush" instructions to flush out the parts of A and B that you know you no longer need.

Resources