Crash while calling Aws::Iot::MqttClientConnectionConfigBuilder.build() - aws-iot

i am getting cross this problem when i try to create an mqtt connection with some key, certificate and an endpoint that i get from provisioning in a previous step, when the i call for Aws::Iot::MqttClientConnectionConfigBuilder.build() to create the connection config, the binary crashes with:
free(): invalid pointer
[INFO] [2022-07-13T08:25:48Z] [00007f7c560cc800] [event-loop] - id=0x555d59343da0: Initializing edge-triggered epoll
[INFO] [2022-07-13T08:25:48Z] [00007f7c560cc800] [event-loop] - id=0x555d59343da0: Using eventfd for cross-thread notifications.
[TRACE] [2022-07-13T08:25:48Z] [00007f7c560cc800] [event-loop] - id=0x555d59343da0: eventfd descriptor 5.
[INFO] [2022-07-13T08:25:48Z] [00007f7c560cc800] [event-loop] - id=0x555d59343da0: Starting event-loop thread.
[INFO] [2022-07-13T08:25:48Z] [00007f7c560cc800] [dns] - id=0x555d593642a0: Initializing default host resolver with 1 max host entries.
[INFO] [2022-07-13T08:25:48Z] [00007f7c560cc800] [channel-bootstrap] - id=0x555d59366550: Initializing client bootstrap with event-loop group 0x555d59365820
[DEBUG] [2022-07-13T08:25:48Z] [00007f7c560cc800] [mqtt-client] - client=0x555d593665c0: Initalizing MQTT client
[DEBUG] [2022-07-13T08:25:48Z] [00007f7c560cc800] [tls-handler] - ctx: Certificate and key have been set, setting them up now.
[INFO] [2022-07-13T08:25:48Z] [00007f7c558c8640] [event-loop] - id=0x555d59343da0: main loop started
[TRACE] [2022-07-13T08:25:48Z] [00007f7c558c8640] [event-loop] - id=0x555d59343da0: subscribing to events on fd 5
[INFO] [2022-07-13T08:25:48Z] [00007f7c558c8640] [event-loop] - id=0x555d59343da0: default timeout 100000, and max events to process per tick 100
[TRACE] [2022-07-13T08:25:48Z] [00007f7c558c8640] [event-loop] - id=0x555d59343da0: waiting for a maximum of 100000 ms
Aborted (core dumped)
after some debugging the binary crashes on :
https://github.com/aws/s2n-tls/blob/8314a96de0c33a426ae877856a8a1a431d354e0d/crypto/s2n_certificate.c#L310
i did not really understand why, more over i saw some double freeing in :
https://github.com/aws/s2n-tls/blob/8314a96de0c33a426ae877856a8a1a431d354e0d/crypto/s2n_certificate.c#L317
that could cause a crash (may be) :).
i am compiling the sdk for x86 architecture and using the following flags: -DOPENSSL_NO_ASM=TRUE -DBUILD_SHARED_LIBS=ON
may be i am missing some thing, any help would be appreciated on how can i solve that.
thank you

Related

Telegraf: [inputs.sqlserver] Error in plugin read: connection reset by peer

I am using the SQL Server plugin with the telegraf helm chart in the AKS cluster in order to monitor SQL servers that are on premise. My values file is as follows is as follows:
## Exposed telegraf configuration
## For full list of possible values see `/docs/all-config-values.yaml` and `/docs/all-config-values.toml`
## ref: https://docs.influxdata.com/telegraf/v1.1/administration/configuration/
config:
outputs:
- health:
service_address: "http://:8888"
- influxdb:
urls:
- "http://monitoring-influxdb.monitoring.svc.cluster.local:8086"
database: "telegraf"
username: admin
password: admin
inputs:
- sqlserver:
servers:
- "Server="XX.XX.XX.XX;Port=1433;User Id=sql_telegraf;Password=XXXXXXXX;app name=telegraf;log=1;"
However, I seem to be getting the following error all the time:
2020-02-20T04:22:26Z W! [agent] [inputs.sqlserver] did not complete
within its interval
2020-02-20T04:22:36Z W! [agent] [inputs.sqlserver] did not
complete within its interval
2020-02-20T04:22:36Z I! ERROR: Intercepted panic read tcp
YY.YY.YY.YY:45556->XX.XX.XX.XX:1433: read: connection reset by peer │ │ 2020-02-20T04:22:36Z E! [inputs.sqlserver] Error in plugin: read
tcp YY.YY.YY.YY:45556->XX.XX.XX.XX:1433: read: connection reset by
peer
2020-02-20T04:22:46Z W! [agent] [inputs.sqlserver] did not complete within its interval
2020-02-20T04:22:56Z W! [agent] [inputs.sqlserver] did not complete within its interval
2020-02-20T04:22:57Z I! ERROR: Intercepted panic read tcp YY.YY.YY.YY:45980->XX.XX.XX.XX:1433: read: connection reset by peer
2020-02-20T04:22:57Z E! [inputs.sqlserver] Error in plugin: read
tcp YY.YY.YY.YY:45980->XX.XX.XX.XX:1433: read: connection reset by peer
2020-02-20T04:23:01Z I! ERROR: BeginRead failed read tcp
YY.YY.YY.YY:45380->XX.XX.XX.XX:1433: read: connection reset by peer
2020-02-20T04:23:01Z E! [inputs.sqlserver] Error in plugin: read
tcp YY.YY.YY.YY:45380->XX.XX.XX.XX:1433: read: connection reset by peer
2020-02-20T04:23:06Z W! [agent] [inputs.sqlserver] did not complete within its interval
2020-02-20T04:23:08Z I! ERROR: Intercepted panic read tcp
YY.YY.YY.YY:45374->XX.XX.XX.XX:1433: read: connection reset by peer
2020-02-20T04:23:08Z E! [inputs.sqlserver] Error in plugin: read
tcp YY.YY.YY.YY:45374->XX.XX.XX.XX:1433: read: connection reset by peer

I am getting an error on mvn verify site

It gives me this error pointing to one of the pom files. I checked the plugin and its not null. I even updated it to the latest version. still doesn't fix it.
Execution default-site of goal org.apache.maven.plugins:maven-site-plugin:3.4:site failed: Anchor name cannot be null
It looks like that exception surfaces in a lot of cases and is incredibly unhelpful.
There was an issue a while back where not having a license name defined in the pom would throw this exception.
I ran into this exception (with version 3.4) where I had a typo in my changes.xml file (verison instead of version). I was only tipped off by seeing that it failed in a ChangesReportGenerator class. Even then I glanced over the typo several times before seeing it.
For anyone else running into this, take the time to check for even the most random typos - it might just be the problem.
For reference, here's the complete error message from the log:
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] example.products.parent ...................... SUCCESS [0.709s]
[INFO] Example Product .............................. FAILURE [2.199s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 13.162s
[INFO] Finished at: Fri Sep 02 12:53:19 CDT 2016
[INFO] Final Memory: 26M/64M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-site-plugin:3.4:site (default-site) on project example.product: Execution default-site of goal org.apache.maven.plugins:maven-site-plugin:3.4:site failed: Anchor name cannot be null! -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-site-plugin:3.4:site (default-site) on project example.product: Execution default-site of goal org.apache.maven.
plugins:maven-site-plugin:3.4:site failed: Anchor name cannot be null!
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:225)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
Caused by: org.apache.maven.plugin.PluginExecutionException: Execution default-site of goal org.apache.maven.plugins:maven-site-plugin:3.4:site failed: Anchor name cannot be null!
at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:110)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
... 19 more
Caused by: java.lang.NullPointerException: Anchor name cannot be null!
at org.apache.maven.doxia.sink.XhtmlBaseSink.anchor(XhtmlBaseSink.java:1545)
at org.apache.maven.doxia.siterenderer.sink.SiteRendererSink.anchor(SiteRendererSink.java:253)
at org.apache.maven.doxia.sink.XhtmlBaseSink.anchor(XhtmlBaseSink.java:1533)
at org.apache.maven.plugin.issues.AbstractIssuesReportGenerator.sinkSectionTitle2Anchor(AbstractIssuesReportGenerator.java:181)
at org.apache.maven.plugin.changes.ChangesReportGenerator.constructRelease(ChangesReportGenerator.java:528)
at org.apache.maven.plugin.changes.ChangesReportGenerator.constructReleases(ChangesReportGenerator.java:511)
at org.apache.maven.plugin.changes.ChangesReportGenerator.doGenerateReport(ChangesReportGenerator.java:230)
at org.apache.maven.plugin.changes.ChangesMojo.executeReport(ChangesMojo.java:356)
at org.apache.maven.reporting.AbstractMavenReport.generate(AbstractMavenReport.java:196)
at org.apache.maven.plugins.site.render.ReportDocumentRenderer.renderDocument(ReportDocumentRenderer.java:224)
at org.apache.maven.doxia.siterenderer.DefaultSiteRenderer.renderModule(DefaultSiteRenderer.java:311)
at org.apache.maven.doxia.siterenderer.DefaultSiteRenderer.render(DefaultSiteRenderer.java:129)
at org.apache.maven.plugins.site.render.SiteMojo.renderLocale(SiteMojo.java:182)
at org.apache.maven.plugins.site.render.SiteMojo.execute(SiteMojo.java:141)
at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:101)
... 20 more
[ERROR]
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command

Cassandra 2.0.7 to 2.1.2 sstable upgradesstables, compaction problems

We upgraded Cassandra (5+5 nodes) 2.0.9 to 2.1.2 (binaries) and ran nodetool upgradesstables one-by-one (bash script), after this we observe some problems:
on every node we observe about 50 "Pending Tasks" on one of them more than 500, it has persist for 5 days - when we started nodetool upgradesstables, even if concurrent_compactors is set to 8 cassandra never run more than 3-4 at the same time. One node with more than 500 tasks pending has about 11k files in column family directory...
we have 2 ssd disks but during compacting there is up to 10MB/s reads and maximum 5MB/s writes - even if compaction_throughput_mb_per_sec is set to 32 or 64 or 256
during upgradesstables on some tables got :
WARN [RMI TCP Connection(100)-10.64.72.34] 2014-12-21 23:53:18,953 ColumnFamilyStore.java:2492 - Unable to cancel in-progress compactions for reco_active_items_v1. Perhaps there is an unusually large row in progress somewhere, or the system is simply overloaded.
INFO [RMI TCP Connection(100)-10.64.72.34] 2014-12-21 23:53:18,953 CompactionManager.java:247 - Aborting operation on reco_prod.reco_active_items_v1 after failing to interrupt other compaction operations
nodetool is failing with:
Aborted upgrading sstables for atleast one column family in keyspace reco_prod, check server logs for more information.
on some nodes nodetool upgradesstables finished succefully but still can see jb files in column family directory.
nodetool upgradesstables on some nodes returns:
error: null
-- StackTrace --
java.lang.NullPointerException
at org.apache.cassandra.io.sstable.SSTableReader.cloneWithNewStart(SSTableReader.java:952)
at org.apache.cassandra.io.sstable.SSTableRewriter.moveStarts(SSTableRewriter.java:250)
at org.apache.cassandra.io.sstable.SSTableRewriter.switchWriter(SSTableRewriter.java:300)
at org.apache.cassandra.io.sstable.SSTableRewriter.abort(SSTableRewriter.java:186)
at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:204)
at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:75)
at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
at org.apache.cassandra.db.compaction.CompactionManager$4.execute(CompactionManager.java:340)
at org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:267)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
This is our production env (24h) and we observe higher load on nodes , higher read latency even more than 1 sec.
Any advise...?

Restart a mpi slave after checkpoint before failure on ARMv6

UPDATE
I have an university project in which I should build up a cluster with RPis.
Now we have a fully functional system with BLCR/MPICH on.
BLCR works very well with normal processes linked with the lib.
Demonstrations we have to show from our management web interface are:
parallel execution of a job
migration of processes across the nodes
fault tolerance with MPI
We are allowed to use the simplest computations.
The first one we got easily, with MPI too. The second point we actually have only working with normal processes (without MPI). Regarding the third point I have less idea how to implement a master-slave MPI scheme, in which I can restart a slave process, which also affects point two because we should/can/have_to make a checkpoint of the slave process, kill/stop it and restart it on another node. I know that I have to handle the MPI_Errors myself but how to restore the process? It would be nice if someone could post me a link or paper (with explanations) at least.
Thanks in advance
UPDATE:
As written earlier our BLCR+MPICH stuff works or seems to.
But... When I start MPI Processes checkpointing seems to work well.
Here the proof:
... snip ...
Benchmarking: dynamic_5: md5($s.$p.$s) [32/32 128x1 (MD5_Body)]... DONE
Many salts: 767744 c/s real, 767744 c/s virtual
Only one salt: 560896 c/s real, 560896 c/s virtual
Benchmarking: dynamic_5: md5($s.$p.$s) [32/32 128x1 (MD5_Body)]... [proxy:0:0#node2] requesting checkpoint
[proxy:0:0#node2] checkpoint completed
[proxy:0:1#node1] requesting checkpoint
[proxy:0:1#node1] checkpoint completed
[proxy:0:2#node3] requesting checkpoint
[proxy:0:2#node3] checkpoint completed
... snip ...
If I kill one Slave-Process on any node I get this here:
... snip ...
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
... snip ...
It is ok because we have a checkpoint so we can restart our application.
But it doesn't work:
pi 7380 0.0 0.2 2984 1012 pts/4 S+ 16:38 0:00 mpiexec -ckpointlib blcr -ckpoint-prefix /tmp -ckpoint-num 0 -f /tmp/machinefile -n 3
pi 7381 0.1 0.5 5712 2464 ? Ss 16:38 0:00 /usr/bin/ssh -x 192.168.42.101 "/usr/local/bin/mpich/bin/hydra_pmi_proxy" --control-port masterpi:47698 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
pi 7382 0.1 0.5 5712 2464 ? Ss 16:38 0:00 /usr/bin/ssh -x 192.168.42.102 "/usr/local/bin/mpich/bin/hydra_pmi_proxy" --control-port masterpi:47698 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1
pi 7383 0.1 0.5 5712 2464 ? Ss 16:38 0:00 /usr/bin/ssh -x 192.168.42.105 "/usr/local/bin/mpich/bin/hydra_pmi_proxy" --control-port masterpi:47698 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 2
pi 7438 0.0 0.1 3548 868 pts/1 S+ 16:40 0:00 grep --color=auto mpi
I don't know why but the first time I restart the app on every node the process seems to be restarted (I got it from using top or ps aux | grep "john" but no output to the management (or on the management console/terminal) is shown. It just hangs up after showing me:
mpiexec -ckpointlib blcr -ckpoint-prefix /tmp -ckpoint-num 0 -f /tmp/machinefile -n 3
Warning: Permanently added '192.168.42.102' (ECDSA) to the list of known hosts.
Warning: Permanently added '192.168.42.101' (ECDSA) to the list of known hosts.
Warning: Permanently added '192.168.42.105' (ECDSA) to the list of known hosts.
My plan B is just to test with own application if the BLCR/MPICH stuff really works. Maybe there some troubles with john.
Thanks in advance
**
UPDATE
**
Next problem with simple hello world. I dispair slowly. Maybe I'm confused too much.
mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/ -ckpoint-interval 3 -f /tmp/machinefile -n 4 ./hello
Warning: Permanently added '192.168.42.102' (ECDSA) to the list of known hosts.
Warning: Permanently added '192.168.42.105' (ECDSA) to the list of known hosts.
Warning: Permanently added '192.168.42.101' (ECDSA) to the list of known hosts.
[proxy:0:0#node2] requesting checkpoint
[proxy:0:0#node2] checkpoint completed
[proxy:0:1#node1] requesting checkpoint
[proxy:0:1#node1] checkpoint completed
[proxy:0:2#node3] requesting checkpoint
[proxy:0:2#node3] checkpoint completed
[proxy:0:0#node2] requesting checkpoint
[proxy:0:0#node2] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:0#node2] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed
[proxy:0:0#node2] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0#node2] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:1#node1] requesting checkpoint
[proxy:0:1#node1] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:1#node1] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed
[proxy:0:1#node1] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1#node1] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:2#node3] requesting checkpoint
[proxy:0:2#node3] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:2#node3] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed
[proxy:0:2#node3] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:2#node3] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec#masterpi] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
[mpiexec#masterpi] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec#masterpi] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec#masterpi] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
hello.c
/* C Example */
#include <stdio.h>
#include <mpi.h>
int main (argc, argv)
int argc;
char *argv[];
{
int rank, size, i, j;
char hostname[1024];
hostname[1023] = '\0';
gethostname(hostname, 1023);
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
i = 0;
for(i ; i < 400000000; i++){
for(j; j < 4000000; j++){
}
}
printf("%s done...", hostname);
printf("%s: %d is alive\n", hostname, getpid());
MPI_Finalize();
return 0;
}

Apache Camel: handling unix file permission errors wrapped in GenericFileOperationFailedException

Here's the problem I've been grappling with for a while...I'm using Camel (v2.10.2) to set up many file routes to move data across file systems, servers, and in/out of the organisation (B2B). There are data and signal files in their respective dirs with some of the routes being short lived, while others run as services on different VMs/servers. These processes (routes) are run under different unix 'functional' ids, but there is an attempt to make them belong to the same unix group(s) if possible...
Of course on unix there is always the potential for file/dir permission problems...and that is the issue I'm facing/trying to solve.
I use the DefaultErrorHandler and log success or failure for an exchange via a custom RoutePolicy within the onExchangeDone(...) checking the Exchange.isFailed(). The signal file is either moved to the destination on success or moved to .error dir on fail, with an alert written to a system-wide alert log (checked by Tivoli)
The file route is configured to propagate errors occurring while picking up files, etc via the consumer.bridgeErrorHandler=true
Basically, if I have any unix permission related errors, then I want to stop (and maybe remove) the effected route, indicating clearly that this has happened and why - a permission issue is not easily solvable programmatically, so stop and alert is the only option.
So I'll illustrate a test case that causes an issue...
App_A creates some data files in ./data/. Then App_A creates the same number of signal files in ./signal/, but due to some 'data' related bug it also creates a signal file ./signal/acc_xyz.csv that doesn't have a corresponding data file.
Route starts to process ./signal/acc_xyz.csv and the 'validation process' finds that ./data/acc_xyz.csv doesn't exist and throws an exception to indicate this, hence stopping the exchange being processed further.
The File component is configured with moveFailed=.error to move the signal file to ./signal/.error/, but this dir is locked (don't worry why this is) to the functional user id executing the Java process and internal Camel processing throws a GenericFileOperationFailedException indicating the cause to be an underlying 'Permission denied' issue.
Oh dear, the same signal file is then processed again, and again, and...
I have tried to get this 'secondary error' propagated to my code, but have failed, hence I can't stop the route.
How can I get this and other internal Camel errors propagated to my code/exception handler/whatever and not just seeing it be logged and swallowed?
thanks in advance
ok more detail from log4j...note the sequence of times
Camel DefaultErrorHandler:
2013-04-25 15:06:26,001 [Camel (camel-1) thread #0 - file:///FTROOT/fileTransfer/outbound/signal] ERROR (MarkerIgnoringBase.java:161) - Failed delivery for (MessageId: ID-rwld601-rw-discoverfinancial-com-60264-1366902384246-0-1 on ExchangeId: ID-rwld601-rw-discoverfinancial-com-60264-1366902384246-0-2). Exhausted after delivery attempt: 1 caught: java.lang.IllegalStateException: missingFile: route [App_A.outboundReceipt] has missing file at /FTROOT/fileTransfer/outbound/data/stuff.log
java.lang.IllegalStateException: missingFile: route [App_A.outboundReceipt] has missing file at /FTROOT/fileTransfer/outbound/data/stuff.log
at com.myco.mft.process.BaseFileRouteBuilder.checkFile(BaseFileRouteBuilder.java:934)
My alert logger via the RoutePolicy.onExchangeDone(...) - at this pont the exchange has completed with a failure:
2013-04-25 15:06:26,011|Camel (camel-1) thread #0 - file:///FTROOT/fileTransfer/outbound/signal|exchange|App_A.outboundReceipt|signalFile=/FTROOT/fileTransfer/outbound/signal/stuff.log|there has been a routing failure|missingFile: route [App_A.outboundReceipt] has missing file at /FTROOT/fileTransfer/outbound/data/stuff.log
Camel endpoint post-processing - this is the stuff that Camel doesn't propagate to me:
2013-04-25 15:06:26,027 [Camel (camel-1) thread #0 - file:///FTROOT/fileTransfer/outbound/signal] WARN (GenericFileOnCompletion.java:149) - Rollback file strategy: org.apache.camel.component.file.strategy.GenericFileDeleteProcessStrategy#104e28b for file: GenericFile[/FTROOT/fileTransfer/outbound/signal/stuff.log]
2013-04-25 15:06:28,038 [Camel (camel-1) thread #0 - file:///FTROOT/fileTransfer/outbound/signal] WARN (MarkerIgnoringBase.java:136) - Caused by: [org.apache.camel.component.file.GenericFileOperationFailedException - Error renaming file from /FTROOT/fileTransfer/outbound/signal/stuff.log to /FTROOT/fileTransfer/outbound/signal/.error/stuff.log]
org.apache.camel.component.file.GenericFileOperationFailedException: Error renaming file from /FTROOT/fileTransfer/outbound/signal/stuff.log to /FTROOT/fileTransfer/outbound/signal/.error/stuff.log
at org.apache.camel.component.file.FileOperations.renameFile(FileOperations.java:72)
...
Caused by: java.io.FileNotFoundException: /FTROOT/fileTransfer/outbound/signal/stuff.log (Permission denied)
at java.io.FileInputStream.open(Native Method)

Resources