For the last few weeks, I have been trying to deal with an intermittent problem on a camel route using XSLT processing following aggregation. It is intermittent in the sence that while it frequently raises this exception, I can re-run the data extract and processing that failed a few seconds later and it usually succeeds. I have yet to find any data that fails consistently.
I am assuming that the aggregation is causing the problem, but I can't for the life of me understand why. I thought it might be the custom aggregation bean I was using, so I replaced it with XSLTAggreationStrategy, but it still intermittently gives this issue, either when further transforming the aggregated XML, or when just writing it out to the a file.
This is executing in an Apache-Karaf environment, and I have Camel-Saxon 2.21.2 and Apache ServiceMix Saxon-HE 9.8.0.8_1 bundles loaded.
Thanks for looking.
The abridged stack trace is:
...
Caused by: [java.lang.NullPointerException -
null]java.lang.NullPointerException
at net.sf.saxon.dom.DOMNodeWrapper$ChildEnumeration.skipFollowingTextNodes(DOMNodeWrapper.java:1149)
at net.sf.saxon.dom.DOMNodeWrapper$ChildEnumeration.next(DOMNodeWrapper.java:1178)
at net.sf.saxon.tree.util.Navigator$EmptyTextFilter.next(Navigator.java:1078)
at net.sf.saxon.tree.util.Navigator$AxisFilter.next(Navigator.java:1039)
at net.sf.saxon.tree.util.Navigator$AxisFilter.next(Navigator.java:1017)
at net.sf.saxon.expr.parser.ExpressionTool.effectiveBooleanValue(ExpressionTool.java:643)
at net.sf.saxon.expr.Expression.effectiveBooleanValue(Expression.java:532)
at net.sf.saxon.pattern.PatternWithPredicate.matches(PatternWithPredicate.java:141)
at net.sf.saxon.trans.Mode.searchRuleChain(Mode.java:570)
at net.sf.saxon.trans.Mode.getRule(Mode.java:476)
at net.sf.saxon.trans.Mode.applyTemplates(Mode.java:1041)
at net.sf.saxon.expr.instruct.ApplyTemplates.apply(ApplyTemplates.java:281)
at net.sf.saxon.expr.instruct.ApplyTemplates.processLeavingTail(ApplyTemplates.java:241)
at net.sf.saxon.expr.instruct.Template.applyLeavingTail(Template.java:239)
at net.sf.saxon.trans.Mode.applyTemplates(Mode.java:1057)
at net.sf.saxon.expr.instruct.ApplyTemplates.apply(ApplyTemplates.java:281)
at net.sf.saxon.expr.instruct.ApplyTemplates.process(ApplyTemplates.java:237)
at net.sf.saxon.expr.instruct.ElementCreator.processLeavingTail(ElementCreator.java:431)
at net.sf.saxon.expr.instruct.ElementCreator.processLeavingTail(ElementCreator.java:373)
at net.sf.saxon.expr.instruct.Template.applyLeavingTail(Template.java:239)
at net.sf.saxon.trans.Mode.applyTemplates(Mode.java:1057)
at net.sf.saxon.Controller.transformDocument(Controller.java:2080)
at net.sf.saxon.Controller.transform(Controller.java:1903)
at org.apache.camel.builder.xml.XsltBuilder.process(XsltBuilder.java:141)
at org.apache.camel.impl.ProcessorEndpoint.onExchange(ProcessorEndpoint.java:103)
at org.apache.camel.component.xslt.XsltEndpoint.onExchange(XsltEndpoint.java:138) ...
In 9.8.0.8, the class net.sf.saxon.dom.DOMNodeWrapper has only 1144 lines, so a stacktrace showing line 1178 suggests there's some kind of versioning problem.
The class DOMNodeWrapper was first introduced in 9.5 (previously it was called NodeWrapper), and the line numbers are just one off from those in the current 9.5 source, so I suspect what you have loaded is some sub-release of the 9.5 branch. Other line numbers in the stack trace are also consistent with this being 9.5.
That of course doesn't explain the problem, but it might give a clue.
My immediate instinct was that over the years since 9.5 we might have fixed a multi-threading bug. DOM is not thread-safe, so Saxon takes considerable care to synchronize its access. Saxon bug https://saxonica.plan.io/issues/2376 addresses this problem. On the 9.5 branch this was first fixed in maintenance release 9.5.1.11, so it's possible you don't have that patch. I think it would be useful to investigate why you are loading an old version of Saxon, and another useful angle would be to discover exactly which version it is (the static method net.sf.saxon.Version.getProductVersion() will give you this information.)
Incidentally, if you are using multi-threaded access to a DOM tree then you should ask yourself whether this is a good idea. Saxon access to DOM is slow at the best of times (compared to JDOM and XOM, let alone to Saxon's native tree model), and the lack of thread safety and the need for synchronisation makes it a pretty poor choice in a multi-threaded application.
Also, note that Saxon can synchronize its own access to the DOM, but it can't synchronize with third-party code that might also be using the DOM.
Related
I am trying to de-duplicate events in my Flink pipeline. I am trying to do that using guava cache.
My requirement is that, I want to de-duplicate over a 1 minute window. But at any given point I want to maintain not more than 10000 elements in the cache.
A small background on my experiment with Flink windowing:
Tumbling Windows: I was able to implement this using Tumbling windows + custom trigger. But the problem is, if an element occurs in the 59th minute and 61st minute, it is not recognized as a duplicate.
Sliding Windows: I also tried sliding window with 10 second overlap + custom trigger. But an element that came in the 55th second is part of 5 different windows and it is written to the sink 5 times.
Please let me know if I should not be seeing the above behavior with windowing.
Back to Guava:
I have Event which looks like this and a EventsWrapper for these events which looks like this. I will be getting a stream of EventsWrappers. I should remove duplicate Events across different EventsWrappers.
Example if I have 2 EventsWrappers like below:
[EventsWrapper{id='ew1', org='org1', events=[Event{id='e1',
name='event1'}, Event{id='e2', name='event2'}]},
EventsWrapper{id='ew2', org='org2', events=[Event{id='e1',
name='event1'}, Event{id='e3', name='event3'}]}
I should emit as output the following:
[EventsWrapper{id='ew1', org='org1', events=[Event{id='e1',
name='event1'}, Event{id='e2', name='event2'}]},
EventsWrapper{id='ew2', org='org2', events=[Event{id='e3', name='event3'}]}
i.e Making sure that e1 event is emitted only once assuming these two events are within the time and size requirements of the cache.
I created a RichFlatmap function where I initiate a guava cache and value state like this. And set the Guava cache in the value state like this. My overall pipeline looks like this.
But each time I try to update the guava cache inside the value state:
eventsState.value().put(eventId, true);
I get the following error:
java.lang.NullPointerException
at com.google.common.cache.LocalCache.hash(LocalCache.java:1696)
at com.google.common.cache.LocalCache.put(LocalCache.java:4180)
at com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4888)
at events.piepline.DeduplicatingFlatmap.lambda$flatMap$0(DeduplicatingFlatmap.java:59)
at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:176)
On further digging, I found out that the error is because the keyEquivalence inside the Guava cache is null.
I checked by directly setting on the Guava cache(not through state, but directly on the cache) and that works fine.
I felt this could be because, ValueState is not able to serialize GuavaCache. So I added a Serializer like this and registered it like this:
env.registerTypeWithKryoSerializer((Class<Cache<String,Boolean>>)(Class<?>)Cache.class, CacheSerializer.class);
But this din't help either.
I have the following questions:
Any idea what I might be doing wrong with the Guava cache in the above case.
Is what I am seeing with my Tumbling and Slinding windows implementation is what is expected or am I doing something wrong?
What will happen if I don't set the Guava Cache in ValueState, instead just use it as a plain object in the DeduplicatingFlatmap class and operate directly on the Guava Cache instead of operating through the ValueState? My understanding is, the Guava cache won't be part of the Checkpoint. So when the pipeline fails and restarts, the GuavaCahe would have lost all the values in it and it will be empty on restart. Is this understanding correct?
Thanks a lot in advance for the help.
See below.
These windows are behaving as expected.
Your understanding is correct.
Even if you do get it working, using a Guava cache as ValueState will perform very poorly, because RocksDB is going to deserialize the entire cache on every access, and re-serialize it on every update.
Moreover, it looks like you are trying to share a single cache instance across all of the orgs that happen to be multiplexed across a single flatmap instance. That's not going to work, because the RocksDB state backend will make a copy of the cache for each org (a side effect of the serialization involved).
Your requirements aren't entirely clear, but a deduplication query might help. But I'm thinking MapState in combination with timers in a KeyedProcessFunction is more likely to be the building block you need. Here's an example that might help you get started (but you'll be wanting to handle the timers differently).
I am trying to figure out the best practices to deal with poison messages / unhandled exceptions with Apache Flink. We have a Job doing real time event processing of location data from IoT devices. There are two potential scenarios where this can arise:
Data is bad in some way - e.g. invalid value
Data triggers a bug due to some edge case we have not anticipated.
Currently, all my data processing stops because of just one message.
I've seen two suggestions:
Catch the exceptions - this requires me wrapping every piece of logic with something to catch every runtime exception
Use side outputs as a kind of DLQ - from what I can tell this seems to be a variation on #1 where I have to catch all the exceptions and send them to the side output.
Is there really no way to do this other than wrap every piece of logic with exception handling? Is there no generic way to catch exceptions and not have processing continue?
I think the idea is not to catch all kinds of exceptions and send them elsewhere, but rather to have well-tested and functioning code and use dead letters only for invalid inputs.
So a typical pipeline would be
source => validate => ... => sink
\=> dead letter queue
As soon as your record passes your validate operator, you want all errors to bubble up, as any error in these operators may result in corrupted aggregates and data that - once written - cannot be reverted easily.
The validate step would work with any of the two approaches that you outlined. Typically, side-outputs have better semantics, but you may end up with more code.
Now you may have a service with high SLAs and actually want it to produce output even if it is corrupted just to produce data. Or you have simple transformation pipeline, where you'd miss some events but keep the majority (and downstream can deal with incomplete data). Then you are right that you need to wrap the code of all operators with try-catch. However, you'd typically still would only do it for the fragile operators and not for all of them. Trivial operators should be tested and then trusted to work. Further, you'd usually only catch specific kinds of exceptions to limit the scope to the kind of expected exceptions that can happen.
You might wonder why Flink doesn't have it incorporated as a default pattern. There are two reasons as far as I can see:
If Flink silently ignores any kind of exception and sends an extra message to a secondary sink, how can Flink ensure that the throwing operator is in a sane state afterwards? How can it avoid any kind of leaks that may happen because cleanup code is not executed?
It's more common in Java to let the developers explicitly reason about exceptions and exception handling. It's also not straight-forward to see what the requirements are: Do you want to have the input only? Do you also want to store the exception? What about the operator state that may have influenced the outcome? Should Flink still fail when too many errors have been received in a given time window? It quickly becomes a huge feature for something that should not happen at all in an ideal world where high quality data is ingested and properly processed.
So while it looks easy for your case because you exactly know which kinds of information you want to store, it's not easy to have a solution for all purposes, especially since the extra code that a user has to write is tiny compared to the generic solution.
What you could do is to extract most of the complicated logic things into a single ProcessFunction and use side-outputs as you have outlined. Since it's a central piece, you'd only need to write the side-output function once. If it's done multiple times, you could extract a helper function where you pass your actual code as a RunnableWithException lambda which hides all the side-output logic. Make sure you use plenty of finally blocks to ensure a sane state.
I'd also add quite a few IT cases and use mutation testing to harden your pipeline quicker. If you keep your test data inline, the mutants may also exactly simulate your unexpected data issues, such that your validate operator gets more complete.
I am trying to upgrade from jconn2 to jconn4. The issue that i am facing is that c3p0 is notworking as expected. Quick online search says that it supports jconn4 completely, but i get the below exception.
com.sybase.jdbc4.jdbc.SybSQLException: SQL Anywhere Error -685: Resource governor for 'prepared statements' exceeded
at com.sybase.jdbc4.tds.Tds.processEed(Tds.java:4003)
at com.sybase.jdbc4.tds.Tds.nextResult(Tds.java:3093)
at com.sybase.jdbc4.jdbc.ResultGetter.nextResult(ResultGetter.java:78)
at com.sybase.jdbc4.jdbc.SybStatement.nextResult(SybStatement.java:289)
at com.sybase.jdbc4.jdbc.SybStatement.nextResult(SybStatement.java:271)
at com.sybase.jdbc4.jdbc.SybStatement.queryLoop(SybStatement.java:2408)
at com.sybase.jdbc4.jdbc.SybStatement.executeQuery(SybStatement.java:2394)
at com.sybase.jdbc4.jdbc.SybPreparedStatement.executeQuery(SybPreparedStatement.java:257)
Any suggestions on how to tackle this issue??/
It looks like your issue is too many prepared statements open, relative to max_statement_count defined on your server.
The simplest thing to do would just be to turn Statement caching off in c3p0, i.e. make sure that the c3p0 properties maxStatements and maxStatementsPerConnection are set to 0. If you want the performance benefits of statement caching, make sure that maxStatements is set to a value significantly lower than the server-side max_statement_count. You could also turn the "resource governor" off by setting max_statement_count to zero, although Sybase seems to discourage that.
See also c3p0 docs re Statement caching.
I'm running into a strange issue on Vista with the Performance monitoring API. I'm currently using code that worked fine on XP/2k, based around PdhGetFormattedCounterValue(). I start out using PdhExpandWildCardPath to expand the counters (I'm interested in overall network statistics), the counters I'm looking at are:
\\Network Interface(*)\\Bytes Received/sec
\\Network Interface(*)\\Bytes Sent/sec
\\Processor(_Total)\\% Processor Time
The problem is that on their first call they return PDH_INVALID_DATA, I don't think this is a problem, since if I query it again I will start getting data without the error. The problem is this - while the processor time is worked exactly as expected, neither of the network interface counters are returning anything - just 0 all the time. I verified using Perfmon that they are reporting data normally, so I'm at a loss as to what might be the issue. I caught this at MS:
http://support.microsoft.com/?scid=kb%3Ben-us%3B287159&x=11&y=9
But I'm not interested in multi-language for my task, so I don't think this is relevant. I will see if I can come up with some basic code showing exactly what I'm doing, but nothing is returning anything strange, and it worked on XP/2k, so I suspect something changed under the hood. Thanks!
It turns out the issue was that the network interfaces are both wildcards, whereas the Processor one is actually already rolled up by the performance monitoring. What I didn't realize was that it PdhExpandWildCardPath didn't return something directly usable by PdhAddCounter. By this I mean that if ExpandWildCard returns 3 expanded matches, they come back as a null separated strings - I understood this, but I had assumed that AddCounter would be effectively create a counter containing all three. Nope, reality is I needed to break up each path and request it individually from AddCounter, then roll up the results manually when I get them.
Hopefully this helps someone else to avoid the same mistake I made with less frustration. ;)
When you take your first look at an Oracle database, one of the first questions is often "where's the alert log?". Grid Control can tell you, but its often not available in the environment.
I posted some bash and Perl scripts to find and tail the alert log on my blog some time back, and I'm surprised to see that post still getting lots of hits.
The technique used is to lookup background_dump_dest from v$parameter. But I only tested this on Oracle Database 10g.
Is there a better approach than this? And does anyone know if this still works in 11g?
Am sure it will work in 11g, that parameter has been around for a long time.
Seems like the correct way to find it to me.
If the background_dump_dest parameter isn't set, the alert.log will be put in $ORACLE_HOME/RDBMS/trace
Once you've got the log open, I would consider using File::Tail or File::Tail::App to display it as it's being written, rather than sleeping and reading. File::Tail::App is particularly clever, because it will detect the file being rotated and switch, and will remember where you were up to between invocations of your program.
I'd also consider locking your cache file before using it. The race condition may not bother you, but having multiple people try to start your program at once could result in nasty fights over who gets to write to the cache file.
However both of these are nit-picks. My brief glance over your code doesn't reveal any glaring mistakes.