Asymmetric Cache Configuration for gem5 ARM bigLITTLE Simulator

Asymmetric Cache Configuration for gem5 ARM bigLITTLE Simulator - arm

I'm writing because I am currently working on a project using the gem5 simulator to simulate an ARM bigLITTLE configuration where the big CPU cluster has an L2 but the little CPU cluster does not. That is, I would like to simulate a system in which the little cores are even simpler than their default configuration. I am running the project using the full system bigLITTLE file (i.e., gem5/configs/example/fs_bigLITTLE.py). Is it possible to configure the system in this way?
My initial thought was to modify the python file so that the little cluster configuration is composed of the following:
class LittleCluster(devices.CpuCluster):
def __init__(self, system num_cpus, cpu_clock, cpu_voltage="1.0V"):
cpu_config = [ ObjectList.cpu_list.get("MinorCPU"), devices.L1I, devices.L1D, devices.WalkCache, None]
super(LittleCluster, self).__init__(system, num_cpus, cpu_clock, cpu_voltage, *cpu_config)
or, in layman's terms, provide None as the SimObject class name for the L2. Unfortunately, as one might expect, this causes the system to crash as gem5 expects an object to connect the ports.
My next idea was to write a new SimObject called EmptyCache that inherits the Cache class from gem5, but does nothing. That is, on every call access to the cache this object would return false, and it would be configured to have no tag, data, or response latency. However, this causes coherence issues with the L1 caches in the little cluster, so I then changed it so that it evicts any cache block that it "hits" before returning false (the following is based on a prior post to the gem5-users mailing list: https://www.mail-archive.com/gem5-users#gem5.org/msg16882.html)
EmptyCache::access(PacketPtr pkt, CacheBlk *&blk, Cycles &lat, PacketList &writebacks)
{
if (Cache::access(pkt, blk, lat, writebacks) {
Cache::evictBlock(blk);
}
return false;
}
This seems to solve the coherence issues above with the L1 caches, but this seemed to cause coherence issues in the memory bus (which implements the SystemXBar class (which implements the CoherentXBar class)).
At this point, I feel pretty much stuck and would appreciate any advice that you could provide! Thank you!
Edit: I've continued to try to make headway with this project despite the setbacks in this direction by modifying the gem5/configs/example/arm/devices.py file. I made note that one way to approach this issue would be to add a MinorCPU private-member cache, and then setting the MinorCPU cluster directly to the memory bus while connecting the major CPU cluster to a cache system, but the issue in this direction is the incoherence between packets to the minor and major clusters.
Namely, there are several assertions and panic_if statements in the caches that anticipate that a particular MemCmd enumeration and/or that the packet "hasSharers". Naturally, I assume that this issue has to do with this simulation setup, because the simulator actually runs without it, but is a way to configure the simulator in this direction so that there is some semblance of cohesion between the major and minor CPU clusters?
Again, thank you for your help!

Related

Apache Flink - Filter Performance Tips

Let's say you are working on a big flink project. And also you are keyBy the client ip addresses of your customers.
And realized that you are going to filter the same things in the different code places like that:
public void calculationOne(){
kafkaSource.filter(isContainsSmthA).keyBy(clientip).process(processA).sink(...);
}
public void calculationTwo(){
kafkaSource.filter(isContainsSmthA).keyBy(clientip).process(processB).sink(...);
}
And assumed that they are many kafkaSource.filter(isContainsSmthA)..
Now this structure leads to performance issue in the flink?
If I did something like the below, would be much better?
public Stream filteredA(){
return kafkaSource.filter(isContainsSmthA);
public void calculationOne(){
filteredA().keyBy(clientip).process(processA).sink(...);
}
public void calculationTwo(){
filteredA().keyBy(clientip).process(processB).sink(...);
}

It depends a bit on how it should behave operationally.
The first way is a more friendly to the Kafka cluster: all records are read once. The filter itself is a very cheap operation, so you don't need to worry to much about it. However, the big downside of this approach is that if one calculations is much slower than the others, it will slow them down. If you do not process historic events, it shouldn't matter as you'd size your application cluster to keep up with all events anyways. Another current downside is that if you have a failure in calculationTwo also tasks in calculationOne are restarted. The community is actively working to mitigate that though.
The second way would allow only the affected source -> ... -> sink subtopology to be restarted. So if you expect frequent restarts or need to guarantee certain SLAs, this approach is better. An extension is to actually have separate Flink applications for each of these pipelines. You can share the same jar, but use different arguments to select the correct pipeline on submission. This approach also makes updating of applications much easier as you would only experience downtime for the pipeline that you actually modify.

I might do something like below, where a simple wrapper operator can run data through two different functions, and generate two side outputs.
SingleOutputStreamOperator comboResults = kafkaSource
.filter(isContainsSmthA)
.keyBy(clientip)
.process(new MyWrapperFunction(processA, processB));
comboResults
.getSideOutput(processATag)
.sink(...);
comboResults
.getSideOutput(processBTag)
.sink(...);
Though I don't know how that compares with what Arvid suggested.

What could be causing NullPointerException in Camel XSLT processing?

For the last few weeks, I have been trying to deal with an intermittent problem on a camel route using XSLT processing following aggregation. It is intermittent in the sence that while it frequently raises this exception, I can re-run the data extract and processing that failed a few seconds later and it usually succeeds. I have yet to find any data that fails consistently.
I am assuming that the aggregation is causing the problem, but I can't for the life of me understand why. I thought it might be the custom aggregation bean I was using, so I replaced it with XSLTAggreationStrategy, but it still intermittently gives this issue, either when further transforming the aggregated XML, or when just writing it out to the a file.
This is executing in an Apache-Karaf environment, and I have Camel-Saxon 2.21.2 and Apache ServiceMix Saxon-HE 9.8.0.8_1 bundles loaded.
Thanks for looking.
The abridged stack trace is:
...
Caused by: [java.lang.NullPointerException -
null]java.lang.NullPointerException
at net.sf.saxon.dom.DOMNodeWrapper$ChildEnumeration.skipFollowingTextNodes(DOMNodeWrapper.java:1149)
at net.sf.saxon.dom.DOMNodeWrapper$ChildEnumeration.next(DOMNodeWrapper.java:1178)
at net.sf.saxon.tree.util.Navigator$EmptyTextFilter.next(Navigator.java:1078)
at net.sf.saxon.tree.util.Navigator$AxisFilter.next(Navigator.java:1039)
at net.sf.saxon.tree.util.Navigator$AxisFilter.next(Navigator.java:1017)
at net.sf.saxon.expr.parser.ExpressionTool.effectiveBooleanValue(ExpressionTool.java:643)
at net.sf.saxon.expr.Expression.effectiveBooleanValue(Expression.java:532)
at net.sf.saxon.pattern.PatternWithPredicate.matches(PatternWithPredicate.java:141)
at net.sf.saxon.trans.Mode.searchRuleChain(Mode.java:570)
at net.sf.saxon.trans.Mode.getRule(Mode.java:476)
at net.sf.saxon.trans.Mode.applyTemplates(Mode.java:1041)
at net.sf.saxon.expr.instruct.ApplyTemplates.apply(ApplyTemplates.java:281)
at net.sf.saxon.expr.instruct.ApplyTemplates.processLeavingTail(ApplyTemplates.java:241)
at net.sf.saxon.expr.instruct.Template.applyLeavingTail(Template.java:239)
at net.sf.saxon.trans.Mode.applyTemplates(Mode.java:1057)
at net.sf.saxon.expr.instruct.ApplyTemplates.apply(ApplyTemplates.java:281)
at net.sf.saxon.expr.instruct.ApplyTemplates.process(ApplyTemplates.java:237)
at net.sf.saxon.expr.instruct.ElementCreator.processLeavingTail(ElementCreator.java:431)
at net.sf.saxon.expr.instruct.ElementCreator.processLeavingTail(ElementCreator.java:373)
at net.sf.saxon.expr.instruct.Template.applyLeavingTail(Template.java:239)
at net.sf.saxon.trans.Mode.applyTemplates(Mode.java:1057)
at net.sf.saxon.Controller.transformDocument(Controller.java:2080)
at net.sf.saxon.Controller.transform(Controller.java:1903)
at org.apache.camel.builder.xml.XsltBuilder.process(XsltBuilder.java:141)
at org.apache.camel.impl.ProcessorEndpoint.onExchange(ProcessorEndpoint.java:103)
at org.apache.camel.component.xslt.XsltEndpoint.onExchange(XsltEndpoint.java:138) ...

In 9.8.0.8, the class net.sf.saxon.dom.DOMNodeWrapper has only 1144 lines, so a stacktrace showing line 1178 suggests there's some kind of versioning problem.
The class DOMNodeWrapper was first introduced in 9.5 (previously it was called NodeWrapper), and the line numbers are just one off from those in the current 9.5 source, so I suspect what you have loaded is some sub-release of the 9.5 branch. Other line numbers in the stack trace are also consistent with this being 9.5.
That of course doesn't explain the problem, but it might give a clue.
My immediate instinct was that over the years since 9.5 we might have fixed a multi-threading bug. DOM is not thread-safe, so Saxon takes considerable care to synchronize its access. Saxon bug https://saxonica.plan.io/issues/2376 addresses this problem. On the 9.5 branch this was first fixed in maintenance release 9.5.1.11, so it's possible you don't have that patch. I think it would be useful to investigate why you are loading an old version of Saxon, and another useful angle would be to discover exactly which version it is (the static method net.sf.saxon.Version.getProductVersion() will give you this information.)
Incidentally, if you are using multi-threaded access to a DOM tree then you should ask yourself whether this is a good idea. Saxon access to DOM is slow at the best of times (compared to JDOM and XOM, let alone to Saxon's native tree model), and the lack of thread safety and the need for synchronisation makes it a pretty poor choice in a multi-threaded application.
Also, note that Saxon can synchronize its own access to the DOM, but it can't synchronize with third-party code that might also be using the DOM.

App Engine Memcache Key Prefix Across Versions

Greetings!
I've got a Google App Engine Setup where memcached keys are prefixed with os.environ['CURRENT_VERSION_ID'] in order to produce a new cache on deploy, without having to flush the cache manually.
This was working just fine until it became necessary for development to run two versions of the application at the same time. This, of course, is yielding inconsistencies in caching.
I'm looking for suggestions as to how to prefix the keys now. Essentially, there needs to be a variable that changes across versions when any version is deployed. (Well, this isn't quite ideal, as the cache gets totally blown out.)
I was thinking of the following possibilities:
Make a RuntimeEnvironment entity that stores the latest cache prefix. Drawbacks: even if cached, slows down every request. Cannot be cached in memory, only in memcached, as deployment of other version may change it.
Use a per-entity version number. This yields very nice granularity in that the cache can stay warm for non-modified entities. The downside is we'd need to push to all versions when models are changed, which I want to avoid in order to test model changes out before deploying to production.
Forget key prefix. Global namespace for keys. Write a script to flush the cache on every deploy. This actually seems just as good as, if not better than, the first idea: the cache is totally blown in both scenarios, and this one avoids the overhead of the runtime entity.
Any thoughts, different ideas greatly appreciated!

The os.environ['CURRENT_VERSION_ID'] value will be different to your two versions, so you will have separate caches for each one (the live one, and the dev/testing one).
So, I assume your problem is that when you "deploy" a version, you do not want the cache from development/testing to be used? (otherwise, like Nick and systempuntoout, I'm confused).
One way of achieving this would be to use the domain/host header in the cache - since this is different for your dev/live versions. You can extract the host by doing something like this:
scheme, netloc, path, query, fragment = urlparse.urlsplit(self.request.url)
# Discard any port number from the hostname
domain = netloc.split(':', 1)[0]
This won't give particularly nice keys, but it'll probably do what you want (assuming I understood correctly).

There was a bit of confusion with how i worded the question.
I ended up going for a per-class hash of attributes. Take this class for example:
class CachedModel(db.Model):
#classmethod
def cacheVersion(cls):
if not hasattr(cls, '__cacheVersion'):
props = cls.properties()
prop_keys = sorted(props.keys())
fn = lambda p: '%s:%s' % (p, str(props[p].model_class))
string = ','.join(map(fn, prop_keys))
cls.__cacheVersion = hashlib.md5(string).hexdigest()[0:10]
return cls.__cacheVersion
#classmethod
def cacheKey(cls, key):
return '%s-%s' % (cls.cacheVersion(), str(key))
That way, when entities are saved to memcached using their cacheKey(...), they will share the cache only if the actual class is the same.
This also has the added benefit that pushing an update that does not modify a model, leaves all cache entries for that model intact. In other words, pushing an update no longer acts as flushing the cache.
This has the disadvantage of hashing the class once per instance of the webapp.
UPDATE on 2011-3-9: I changed to a more involved but more accurate way of getting the version. Turns out using __dict__ yielded incorrect results as its str representation includes pointer addresses. This new approach just considers the datastore properties.
UPDATE on 2011-3-14: So python's hash(...) is apparently not guaranteed to be equal between runs of the interpreter. Was getting weird cases where a different app engine instance was seeing different hashes. using md5 (which is faster than sha1 faster than sha256) for now. no real need for it to be crypto-secure. just need an ok hashfn. Will probably switch to use something faster, but for now i rather be bugfree. Also ensured keys were getting sorted, not the property objects.

Problem with performance counters on Vista

I'm running into a strange issue on Vista with the Performance monitoring API. I'm currently using code that worked fine on XP/2k, based around PdhGetFormattedCounterValue(). I start out using PdhExpandWildCardPath to expand the counters (I'm interested in overall network statistics), the counters I'm looking at are:
\\Network Interface(*)\\Bytes Received/sec
\\Network Interface(*)\\Bytes Sent/sec
\\Processor(_Total)\\% Processor Time
The problem is that on their first call they return PDH_INVALID_DATA, I don't think this is a problem, since if I query it again I will start getting data without the error. The problem is this - while the processor time is worked exactly as expected, neither of the network interface counters are returning anything - just 0 all the time. I verified using Perfmon that they are reporting data normally, so I'm at a loss as to what might be the issue. I caught this at MS:
http://support.microsoft.com/?scid=kb%3Ben-us%3B287159&x=11&y=9
But I'm not interested in multi-language for my task, so I don't think this is relevant. I will see if I can come up with some basic code showing exactly what I'm doing, but nothing is returning anything strange, and it worked on XP/2k, so I suspect something changed under the hood. Thanks!

It turns out the issue was that the network interfaces are both wildcards, whereas the Processor one is actually already rolled up by the performance monitoring. What I didn't realize was that it PdhExpandWildCardPath didn't return something directly usable by PdhAddCounter. By this I mean that if ExpandWildCard returns 3 expanded matches, they come back as a null separated strings - I understood this, but I had assumed that AddCounter would be effectively create a counter containing all three. Nope, reality is I needed to break up each path and request it individually from AddCounter, then roll up the results manually when I get them.
Hopefully this helps someone else to avoid the same mistake I made with less frustration. ;)

How to make your embedded C code immune to requirement changes without adding too much overhead and complexity?

In many embedded applications there is a tradeoff between making the code very efficient or isolating the code from the specific system configuration to be immune to changing requirements.
What kinds of C constructs do you usually employ to achieve the best of both worlds (flexibility and reconfigurabilty without losing efficiency)?
If you have the time, please read on to see exactly what I am talking about.
When I was developing embedded SW for airbag controllers, we had the problem that we had to change some parts of the code every time the customer changed their mind regarding the specific requirements. For example, the combination of conditions and events that would trigger the airbag deployment changed every couple weeks during development. We hated to change that piece of code so often.
At that time, I attended the Embedded Systems Conference and heard a brilliant presentation by Stephen Mellor called "Coping with changing requirements". You can read the paper here (they make you sign-up but it's free).
The main idea of this was to implement the core behavior in your code but configure the specific details in the form of data. The data is something you can change easily and it can even be programmable in EEPROM or a different section of flash.
This idea sounded great to solve our problem. I shared this with my colleague and we immediately started reworking some of the SW modules.
When trying to use this idea in our coding, we encountered some difficulty in the actual implementation. Our code constructs got terribly heavy and complex for a constrained embedded system.
To illustrate this I will elaborate on the example I mentioned above. Instead of having a a bunch of if-statements to decide if the combination of inputs was in a state that required an airbag deployment, we changed to a big table of tables. Some of the conditions were not trivial, so we used a lot of function pointers to be able to call lots of little helper functions which somehow resolved some of the conditions. We had several levels of indirection and everything became hard to understand. To make a long story short, we ended up using a lot of memory, runtime and code complexity. Debugging the thing was not straightforward either. The boss made us change some things back because the modules were getting too heavy (and he was maybe right!).
PS: There is a similar question in SO but it looks like the focus is different. Adapting to meet changing business requirements?

As another point of view on changing requirements ... requirements go into building the code. So why not take a meta-approach to this:
Separate out parts of the program that are likely to change
Create a script that will glue parts of source together
This way you are maintaining compatible logic-building blocks in C ... and then sticking those compatible parts together at the end:
/* {conditions_for_airbag_placeholder} */
if( require_deployment)
trigger_gas_release()
Then maintain independent conditions:
/* VAG Condition */
if( poll_vag_collision_event() )
require_deployment=1
and another
/* Ford Conditions */
if( ford_interrupt( FRONT_NEARSIDE_COLLISION ))
require_deploymen=1
Your build script could look like:
BUILD airbag_deployment_logic.c WITH vag_events
TEST airbag_deployment_blob WITH vag_event_emitter
Thinking outloud really. This way you get a tight binary blob without reading in config.
This is sort of like using overlays http://en.wikipedia.org/wiki/Overlay_(programming) but doing it at compile-time.

Our system is subdivided into many components, with exposed configuration and test points. There is a configuration file that is read at start-up that actually helps us instantiate components, attach them to each other, and configure their behavior.
It's very OO-like, in C, with the occasional hack to implement something like inheritance.
In the defense/avionics world software upgrades are very strictly controlled, and you can't just upgrade SW to fix issues... however, for some bizarre reason you can update a configuration file without a major fight. So it's been darn useful for us to be able to specify a lot of our implementation in those configuration files.
There is no magic, just good separation of concerns when designing the system and a bit of foresight on the part of the developers.

What are you trying to save exactly? Effort of code re-work? The red tape of a software version release?
It's possible that changing the code is reasonably straight-forward, and quite possibly easier than changing data in tables. Moving your often-changing logic from code to data is only helpful if, for some reason, it's less effort to modify data rather than code. That might be true if the changes are better expressed in a data form (e.g. numeric parameters stored in EEPROM). Or it might be true if the customer's requests make it necessary to release a new version of software, and a new software version is a costly procedure to build (lots of paperwork, or perhaps OTP chips burned by the chip maker).
Modularity is very good principle for these sort of things. Sounds as though you're already doing it to some degree. It's good to aim to isolate the often-changing code to as small an area as possible, and try to keep the rest of the code ("helper" functions) separate (modular) and as stable as possible.

I don't make the code immune to requirements changes per se, but I always tag a section of code that implements a requirement by putting a unique string in a comment. With the requirements tags in place, I can easily search for that code when the requirement needs a change. This practice also satisfies a CMMI process.
For example, in the requirements document:
The following is a list of
requirements related to the RST:
[RST001] Juliet SHALL start the RST with 5 minute delay when the ignition
is turned OFF.
And in the code:
/* Delay for RST when ignition is turned off [RST001] */
#define IGN_OFF_RST_DELAY 5
...snip...
/* Start RST with designated delay [RST001] */
if (IS_ROMEO_ON())
{
rst_set_timer(IGN_OFF_RST_DELAY);
}

I suppose what you could do is to specify several valid behaviors based on a byte or word of data that you could fetch from EEPROM or an I/O port if necessary and then create generic code to handle all possible events described by those bytes.
For instance, if you had a byte that specified the requirements for releasing the airbag it could be something like:
Bit 0: Rear collision
Bit 1: Speed above 55mph (bonus points for generalizing the speed value!)
Bit 2: passenger in car
...
Etc
Then you pull in another byte that says what events happened and compare the two. If they're the same, execute your command, if not, don't.

For adapting to changing requirements I would concentrate on making the code modular and easy to change, e.g. by using macros or inline functions for parameters which are likely to change.
W.r.t. a configuration which can be changed independently from the code, I would hope that the parameters which are reconfigurable are specified in the requirements, too. Especially for safety-critical stuff like airbag controllers.

Hooking in a dynamic language can be a lifesaver, if you've got the memory and processor power for it.
Have the C talk to the hardware, and then pass up a known set of events to a language like Lua. Have the Lua script parse the event and callback to the appropriate C function(s).
Once you've got your C code running well, you won't have to touch it again unless the hardware changes. All of the business logic becomes part of the script, which in my opinion is a lot easier to create, modify and maintain.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight