Opendaylight Boron : Config Shard not getting created and Circuit Breaker Timed out - karaf

We are using ODL Boron - SR2. We observe a strange behavior of "Config" Shard not getting created when we start ODL in cluster mode in RHEL 6.9. We observe Circuit Breaker Timed Out exception. However "Operational" shard is getting created without any issues. Due to unavailability of "Config" shard we are unable to persist anything in "Config" tree. We checked in JMX console and "Shards" is missing.
This is consistently reproducible in RHEL, however it works in CentOS.
2018-04-04 08:00:38,396 | WARN | saction-29-31'}} | 168 - org.opendaylight.controller.config-manager - 0.5.2.Boron-SR2 | DeadlockMonitor$DeadlockMonitorRunnable | ModuleIdentifier{factoryName='runtime-generated-mapping', instanceName='runtime-mapping-singleton'} did not finish after 26697 ms
2018-04-04 08:00:38,396 | WARN | saction-29-31'}} | 168 - org.opendaylight.controller.config-manager - 0.5.2.Boron-SR2 | DeadlockMonitor$DeadlockMonitorRunnable | ModuleIdentifier{factoryName='runtime-generated-mapping', instanceName='runtime-mapping-singleton'} did not finish after 26697 ms
2018-04-04 08:00:40,690 | ERROR | lt-dispatcher-30 | 216 - com.typesafe.akka.slf4j - 2.4.7 | Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$1 | Failed to persist event type [org.opendaylight.controller.cluster.raft.persisted.UpdateElectionTerm] with sequence number [4] for persistenceId [member-2-shard-default-config].
akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out.
2018-04-04 08:00:40,690 | ERROR | lt-dispatcher-30 | 216 - com.typesafe.akka.slf4j - 2.4.7 | Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$1 | Failed to persist event type [org.opendaylight.controller.cluster.raft.persisted.UpdateElectionTerm] with sequence number [4] for persistenceId [member-2-shard-default-config].
akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out.

This is an issue with akka persistence where it times out trying to write to the disk. See the discussion in https://lists.opendaylight.org/pipermail/controller-dev/2017-August/013781.html.

Related

SQL Server NEWSEQUENTIALID() - clarification for super fast .net core implementation

Currently I'm trying to write SQL Server NEWSEQUENTIALID() in .NET Core 2.2 that should be running really fast and also it should allocate minimum possible amount memory but I need clarification how calculate uuid version and when (which byte to place it or what bit shift is needed). So now I have generated timestamp, retrieved mac address and copied bytes 8 and 9 from some base random generated guid but surely I'm missing something because results doesn't match with output of original algorithm.
byte[16] guidArray;
// mac
guidArray[15] = macBytes[5];
guidArray[14] = macBytes[4];
guidArray[13] = macBytes[3];
guidArray[12] = macBytes[2];
guidArray[11] = macBytes[1];
guidArray[10] = macBytes[0];
// base guid
guidArray[9] = baseGuidBytes[9];
guidArray[8] = baseGuidBytes[8];
// time
guidArray[7] = ticksDiffBytes[0];
guidArray[6] = ticksDiffBytes[1];
guidArray[5] = ticksDiffBytes[2];
guidArray[4] = ticksDiffBytes[3];
guidArray[3] = ticksDiffBytes[4];
guidArray[2] = ticksDiffBytes[5];
guidArray[1] = ticksDiffBytes[6];
guidArray[0] = ticksDiffBytes[7];
var guid = new Guid(guidArray);
Current benchmark results:
Method | Mean | Error | StdDev | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated |
|--------------------------- |----------:|---------:|---------:|------:|--------:|-------:|------:|------:|----------:|
| SqlServerNewSequentialGuid | 37.31 ns | 0.680 ns | 0.636 ns | 1.00 | 0.00 | 0.0127 | - | - | 80 B |
| Guid_Standard | 63.29 ns | 0.435 ns | 0.386 ns | 1.70 | 0.03 | - | - | - | - |
| Guid_Comb | 299.57 ns | 2.902 ns | 2.715 ns | 8.03 | 0.13 | 0.0162 | - | - | 104 B |
| Guid_Comb_New | 266.92 ns | 3.173 ns | 2.813 ns | 7.16 | 0.11 | 0.0162 | - | - | 104 B |
| MyFastGuid | 70.08 ns | 1.011 ns | 0.946 ns | 1.88 | 0.05 | 0.0050 | - | - | 32 B |
Update:
Here are the latest results of benchmarking common id generators written in .net core.
As u can see my implementation NewSequentialGuid_PureNetCore is at most 2x worst performing then wrapper around rpcrt4.dll (which was my baseline) but me implementation eats less memory (30B).
Here are a sequence of sample first 10 guids:
492bea01-456f-3166-0001-e0d55e8cb96a
492bea01-456f-37a5-0002-e0d55e8cb96a
492bea01-456f-aca5-0003-e0d55e8cb96a
492bea01-456f-bba5-0004-e0d55e8cb96a
492bea01-456f-c5a5-0005-e0d55e8cb96a
492bea01-456f-cea5-0006-e0d55e8cb96a
492bea01-456f-d7a5-0007-e0d55e8cb96a
492bea01-456f-dfa5-0008-e0d55e8cb96a
492bea01-456f-e8a5-0009-e0d55e8cb96a
492bea01-456f-f1a5-000a-e0d55e8cb96a
If u want code then give me a sign ;)
The official documentation states it quite clearly:
NEWSEQUENTIALID is a wrapper over the Windows UuidCreateSequential
function, with some byte shuffling applied.
There are also links in the quoted paragraph which might be of interest for you. However, considering that the original code is written in C / C++, I somehow doubt that .NET can outperform it, so reusing the same approach might be a more prudent choice (even though it would involve unmanaged calls).
Having said that, I sincerely hope that you have researched the behaviour of this function and considered all its side effects before deciding to pursue this approach. And I certainly hope you aren't going to use this output as a clustered index for your table(s). The reason for this is also mentioned in the docs (as a warning, no less):
The UuidCreateSequential function has hardware dependencies. On SQL
Server, clusters of sequential values can develop when databases (such
as contained databases) are moved to other computers. When using
Always On and on SQL Database, clusters of sequential values can
develop if the database fails over to a different computer.
Basically, the function generates a monotonous sequence only while the database is in the same hosting environment. When:
a network card gets changed on the bare metal (or whatever else the function depends upon), or
a backup is restored someplace else (think Prod-to-Dev refresh, or simply prod migration / upgrade), or
a failover happens, whether in a cluster or in an AlwaysOn configuration
, the new SQL Server instance will have its own range of generated values, which is supposed not to overlap the ranges of other instances on other machines. If that new range comes "before" the existing values, you'll end up with fragmentation issues for absolutely no good reason. Oh, and top (1) to get the latest value won't work anymore.
Indeed, if all you need is a non-exhaustible monotonous sequence, follow the Greg Low's advice and just stick to bigint. It's half as wide, and no, you can't possibly exhaust it.

PostgreSQL + pgpool replication with miss balancing

I have a PostgreSQL replication M-S with pgpool as a load balancer on master server only. The replication is going OK and there is no delay on the process. The problem is that the master server is receiving more request than the slave even when I have configured a balance different from 50% for each server.
This is the pgpool show_pool_nodes with backend weigth M(1)-S(2)
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+-------------+------+--------+-----------+---------+------------+-------------------+-------------------
0 | master-ip | 9999 | up | 0.333333 | primary | 56348331 | false | 0
1 | slave-ip | 9999 | up | 0.666667 | standby | 3691734 | true | 0
as you can appreciate the master server is receiving +10x request than slave
This is the pgpool show_pool_nodes with backend weigth M(1)-S(5)
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+-------------+------+--------+-----------+---------+------------+-------------------+-------------------
0 | master-ip | 9999 | up | 0.166667 | primary | 10542201 | false | 0
1 | slave-ip | 9999 | up | 0.833333 | standby | 849494 | true | 0
The behave is quite similar when I assign M(1)-S(1)
Now I wonder if I miss understood the pgpool functioning:
Pgpool only balances read queries(as write queries are sent to
master always)
Backend Weight parameter is assigned to calculate distribution only
in balancing mode. As greater the value is more likely to be chosen
for pgpool, so if a server has a greater lb_weight it would be
selected more times than others with lower values.
If I'm right why is happening this?
Is there a way that I can actually assign a proper balancing configuration of select_cnt queries? My intention is to overcharge the slave with read queries and let to master only a "few" read queries as it is taking all the writing.
You are right on pgpool load balancing. There could be some reasons why this doesn't seem to work. For start, notice that you have the same port number for both backends. Try configuring your backend connection settings like shown in the sample pgpool.conf: https://github.com/pgpool/pgpool2/blob/master/src/sample/pgpool.conf.sample (lines 66-87), (where you also set the weights to your needs) and assign different port numbers to each backend.
Also check (assuming your running mode is master/slave):
load_balance_mode = on
master_slave_mode = on
-- changes require restart
There is a relevant FAQ entry " It seems my pgpool-II does not do load balancing. Why?" here: https://www.pgpool.net/mediawiki/index.php/FAQ (if pgpool version 4.1 also consider statement_level_load_balance). So far, i have assumed that the general conditions for load balancing (https://www.pgpool.net/docs/latest/en/html/runtime-config-load-balancing.html) are met.
You can try to adjust below one configs in pgpool.conf file:
1. wal lag delay size
delay_threshold = 10000000
it is used to let pgpool know if the slave postgresql wal is too delay to use. Change large more query can be pass to slave. Change small more query will go to master.
Besides, the pgbench testing parameter is also key. Use -C parameter, it will let connection per query, otherwise connection per session.
pgpoll load balance decision making depends of a matrix of parameter combination. not only a single parameter
Here is reference.
https://www.pgpool.net/docs/latest/en/html/runtime-config-load-balancing.html#GUC-LOAD-BALANCE-MODE

Is there a delay between a SET and a GET with the same key in Redis?

I have three processes on one computer:
A test (T)
A nginx server with my own module (M) --- the test starts and stops this process between each test case section
A Redis server (R), which is always running --- the test does not handle the start/stop sequence of this service (I'm testing my nginx module, not Redis.)
Here is a diagram of the various events:
T M R
| | |
O-------->+ FLUSHDB
| | |
+<--------O (FLUSHDB acknowledge as successful)
| | |
O-------->+ SET key value
| | |
+<--------O (SET acknowledge as successful)
| | |
O--->+ | Start nginx including my module
| | |
| O--->+ GET key
| | |
| +<---O (SUCCESS 80% and FAILURE 20%)
| | |
The test clears the Redis database with FLUSHDB then adds a key with SET key value. The test then starts nginx including my module. There, once in a while, the nginx module GET key action fails.
Note 1: I am not using the ASync implementation of Redis.
Note 2: I am using the C library hiredis.
Is it possible that there would be a delay between a SET and a following GET with the same key which would explain that this process would fail once in a while? Is there a way for me to ensure that the SET is really done once the redisCommand() function returns?
IMPORTANT NOTE: if I run one such test and the GET fails in my nginx module, the key appears in my Redis:
redis-cli
127.0.0.1:6379> KEYS *
1) "8b95d48d13e379f1ccbcdfc39fee4acc5523a"
127.0.0.1:6379> GET "8b95d48d13e379f1ccbcdfc39fee4acc5523a"
"the expected value"
So the
SET "8b95d48d13e379f1ccbcdfc39fee4acc5523a" "the expected value"
worked as expected. Only the GET failed and I would assume that it is because it somehow occurred too quickly. Any idea how to tackle this problem?
No, there is no delay between set and get. What you are doing should work.
Try running the monitor command in a separate window. When it fails - does the set command come before/after the get command?

opendaylight shards are missing in Nitrogen SR3 load

we are using nitrogen SR3 package and we have customized 2 node cluster. After soaking the cluster for 2 or 3 days. we noticed NoShardleaderException. We also checked through JMX and noticed "default-"config and "default-operational" shards in Distributed Data store doesn't exist.
Can you please let us know the possible reasons for shards missing suddenly??
update :
following exceptions are noticed in karaf.log
018-07-09 12:31:11,000 | ERROR | lt-dispatcher-15 | 219 - com.typesafe.akka.slf4j - 2.4.7 | Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$1 | Failed to persist event type [org.opendaylight.controller.cluster.raft.ReplicatedLogImplEntry] with sequence number [13058670] for persistenceId [member-1-shard-default-config].
akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out.
2018-07-09 12:31:11,000 | ERROR | lt-dispatcher-15 | 219 - com.typesafe.akka.slf4j - 2.4.7 | Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$1 | Failed to persist event type [org.opendaylight.controller.cluster.raft.ReplicatedLogImplEntry] with sequence number [13058670] for persistenceId [member-1-shard-default-config].
akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out.
2018-07-19 02:03:14,687 | WARN | t-dispatcher-172 | 505 - org.opendaylight.controller.sal-distributed-datastore - 1.6.3 | ActorContext$4 | broadcast failed to send message CloseTransactionChain to shard default: {}
org.opendaylight.controller.cluster.datastore.exceptions.NoShardLeaderException: Shard member-2-shard-default-config currently has no leader. Try again later.
at org.opendaylight.controller.cluster.datastore.shardmanager.ShardManager.createNoShardLeaderException(ShardManager.java:955)[505:org.opendaylight.controller.sal-distributed-datastore:1.6.3]
at org.opendaylight.controller.cluster.datastore.shardmanager.ShardManager.onShardNotInitializedTimeout(ShardManager.java:787)[505:org.opendaylight.controller.sal-distributed-datastore:1.6.3]
at org.opendaylight.controller.cluster.datastore.shardmanager.ShardManager.handleCommand(ShardManager.java:254)[505:org.opendaylight.controller.sal-distributed-datastore:1.6.3]
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:44)[498:org.opendaylight.controller.sal-clustering-commons:1.6.3]
at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)[322:com.typesafe.akka.persistence:2.4.20]
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)[498:org.opendaylight.controller.sal-clustering-commons:1.6.3]
at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)[317:com.typesafe.akka.actor:2.4.20]
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)[317:com.typesafe.akka.actor:2.4.20]
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)[322:com.typesafe.akka.persistence:2.4.20]
at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:727)[322:com.typesafe.akka.persistence:2.4.20]
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)[322:com.typesafe.akka.persistence:2.4.20]
at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)[322:com.typesafe.akka.persistence:2.4.20]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)[317:com.typesafe.akka.actor:2.4.20]
at akka.actor.ActorCell.invoke(ActorCell.scala:495)[317:com.typesafe.akka.actor:2.4.20]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)[317:com.typesafe.akka.actor:2.4.20]
at akka.dispatch.Mailbox.run(Mailbox.scala:224)[317:com.typesafe.akka.actor:2.4.20]
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[317:com.typesafe.akka.actor:2.4.20]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[616:org.scala-lang.scala-library:2.11.12.v20171031-225310-b8155a5502]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[616:org.scala-lang.scala-library:2.11.12.v20171031-225310-b8155a5502]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[616:org.scala-lang.scala-library:2.11.12.v20171031-225310-b8155a5502]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[616:org.scala-lang.scala-library:2.11.12.v20171031-225310-b8155a5502]
It sounds like the Shard actor threw an exception and akka killed the actor. Look for exceptions in the log.

Is it possible to repair label data in Neo4j?

At some point, based on the logs, my Neo4j server got stopped up and was behaving irrationally until I restarted it. I don't have logs showing what caused the issue, but the shutdown sequence dumped several exceptions into the logs, mainly:
2015-01-29 22:10:04.204+0000 INFO [API] Neo4j Server shutdown initiated by request
...
22:10:20.911 [Thread-13] WARN o.e.j.util.thread.QueuedThreadPool - qtp313783031{STOPPING,2<=1<=20,i=0,q=0} Couldn't stop Thread[qtp313783031-22088,5,main]
2015-01-29 22:10:20.923+0000 INFO [API] Successfully shutdown Neo4j Server.
2015-01-29 22:10:20.936+0000 ERROR [org.neo4j]: Exception when stopping org.neo4j.kernel.impl.nioneo.xa.NeoStoreXaDataSource#68698d61 Failed to flush file channel /var/lib/neo4j/data/graph.db/neostore.propertystore.db
org.neo4j.kernel.impl.nioneo.store.UnderlyingStorageException: Failed to flush file channel /var/lib/neo4j/data/graph.db/neostore.propertystore.db
... stack dump of exception ...
During the three day time period before this restart, some nodes and relationships were created that are able to be found in the system but don't have a proper or full label attached to them. See the below queries for an example of the enigma.
neo4j-sh (?)$ match (b:Book {book_id: 25937}) return b;
+---+
| b |
+---+
+---+
0 row
105 ms
neo4j-sh (?)$ match (b {book_id: 25937}) return b;
+-------------------------------------------------------------+
| b |
+-------------------------------------------------------------+
| Node[97574]{book_id:25937,title:"Writing an autobiography"} |
+-------------------------------------------------------------+
1 row
189 ms
neo4j-sh (?)$ match (b {book_id: 25937}) return id(b), labels(b);
+----------------------+
| id(b) | labels(b) |
+----------------------+
| 97574 | ["Book"] |
+----------------------+
1 row
165 ms
Even if I then explicitly add the label and query for :Book again, it still doesn't return.
neo4j-sh (?)$ match (b {book_id: 25937}) set b :Book return b;
+-------------------------------------------------------------+
| b |
+-------------------------------------------------------------+
| Node[97574]{book_id:25937,title:"Writing an autobiography"} |
+-------------------------------------------------------------+
1 row
143 ms
neo4j-sh (?)$ match (b:Book {book_id: 25937}) return b;
+---+
| b |
+---+
+---+
0 row
48 ms
Also not helpful is dropping and recreating the index(es) I have on these nodes.
The fact that these labels aren't functioning properly has caused my code to begin creating second, "duplicate" (exactly the same except with working labels) nodes do to my on-demand architecture.
I'm starting to believe these nodes are never going to work. Before I dive into reproducing ALL the nodes and relationships created during that three day time span, is it possible to repair what's existing?
-- EDIT --
It seems that dropping and recreating the index DOES solve this problem. I was testing immediately after dropping and immediately after recreating, forgetting that the indexes may take time to build. The system was still misbehaving while operating with NO index, so I thought it wasn't working at all.
TL/DR: Try creating an index on the broken nodes.

Resources