opendaylight shards are missing in Nitrogen SR3 load - karaf

we are using nitrogen SR3 package and we have customized 2 node cluster. After soaking the cluster for 2 or 3 days. we noticed NoShardleaderException. We also checked through JMX and noticed "default-"config and "default-operational" shards in Distributed Data store doesn't exist.
Can you please let us know the possible reasons for shards missing suddenly??
update :
following exceptions are noticed in karaf.log
018-07-09 12:31:11,000 | ERROR | lt-dispatcher-15 | 219 - com.typesafe.akka.slf4j - 2.4.7 | Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$1 | Failed to persist event type [org.opendaylight.controller.cluster.raft.ReplicatedLogImplEntry] with sequence number [13058670] for persistenceId [member-1-shard-default-config].
akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out.
2018-07-09 12:31:11,000 | ERROR | lt-dispatcher-15 | 219 - com.typesafe.akka.slf4j - 2.4.7 | Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$1 | Failed to persist event type [org.opendaylight.controller.cluster.raft.ReplicatedLogImplEntry] with sequence number [13058670] for persistenceId [member-1-shard-default-config].
akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out.
2018-07-19 02:03:14,687 | WARN | t-dispatcher-172 | 505 - org.opendaylight.controller.sal-distributed-datastore - 1.6.3 | ActorContext$4 | broadcast failed to send message CloseTransactionChain to shard default: {}
org.opendaylight.controller.cluster.datastore.exceptions.NoShardLeaderException: Shard member-2-shard-default-config currently has no leader. Try again later.
at org.opendaylight.controller.cluster.datastore.shardmanager.ShardManager.createNoShardLeaderException(ShardManager.java:955)[505:org.opendaylight.controller.sal-distributed-datastore:1.6.3]
at org.opendaylight.controller.cluster.datastore.shardmanager.ShardManager.onShardNotInitializedTimeout(ShardManager.java:787)[505:org.opendaylight.controller.sal-distributed-datastore:1.6.3]
at org.opendaylight.controller.cluster.datastore.shardmanager.ShardManager.handleCommand(ShardManager.java:254)[505:org.opendaylight.controller.sal-distributed-datastore:1.6.3]
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:44)[498:org.opendaylight.controller.sal-clustering-commons:1.6.3]
at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)[322:com.typesafe.akka.persistence:2.4.20]
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)[498:org.opendaylight.controller.sal-clustering-commons:1.6.3]
at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)[317:com.typesafe.akka.actor:2.4.20]
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)[317:com.typesafe.akka.actor:2.4.20]
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)[322:com.typesafe.akka.persistence:2.4.20]
at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:727)[322:com.typesafe.akka.persistence:2.4.20]
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)[322:com.typesafe.akka.persistence:2.4.20]
at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)[322:com.typesafe.akka.persistence:2.4.20]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)[317:com.typesafe.akka.actor:2.4.20]
at akka.actor.ActorCell.invoke(ActorCell.scala:495)[317:com.typesafe.akka.actor:2.4.20]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)[317:com.typesafe.akka.actor:2.4.20]
at akka.dispatch.Mailbox.run(Mailbox.scala:224)[317:com.typesafe.akka.actor:2.4.20]
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[317:com.typesafe.akka.actor:2.4.20]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[616:org.scala-lang.scala-library:2.11.12.v20171031-225310-b8155a5502]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[616:org.scala-lang.scala-library:2.11.12.v20171031-225310-b8155a5502]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[616:org.scala-lang.scala-library:2.11.12.v20171031-225310-b8155a5502]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[616:org.scala-lang.scala-library:2.11.12.v20171031-225310-b8155a5502]

It sounds like the Shard actor threw an exception and akka killed the actor. Look for exceptions in the log.

Related

Too many languages in solr config

We have a solr configuration based on apache solr 8.52.
We use the installation from the TYPO3 extension ext:solr 10.0.3.
In this way we have multiple (39) languages and multiple cores.
As we do not need most of the languages (for sure we need one, maybe two further) I tried to remove most of them with deleting (moving to another folder) all the configurations I identified as other languages, leaving only these folders and files in the solr folders:
server/
+-solr/
| +-configsets/
| | +-ext_solr_10_0_0/
| | +-conf/
| | | +-english/
| | | +-_schema_analysis_stopwords_english.json
| | | +-admin-extra.html
| | | :
| | | +-solrconfig.xml
| | +-typo3lib
| | +-solr-typo3-plugin-4.0.0.jar
| +cores/
| | +-english/
| | +-core.properties
| +-data/
| | +-english/
: : :
I thought that after restarting the server it would only present one language and one core. This was correct.
But on start it noted all the other languages as missing like:
core_es: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load conf for core core_es: Error loading schema resource spanish/schema.xml
Where does solr get this information about all these languages I don't need?
How can I avoid this long list of warnings?
First of all, it does not hurt to have those cores. As long as they are empty and not loaded, they do not take much RAM and CPU.
But if you still want to get rid of them, you need to do it correctly. If you just move core's data directory, this does not mean it is deleted because solr server also needs to adjust config files. Best way is to use curl like this:
curl 'http://localhost:8983/solr/admin/cores?action=UNLOAD&core=core_en&deleteInstanceDir=true'
That would remove the core and all its data.

PostgreSQL + pgpool replication with miss balancing

I have a PostgreSQL replication M-S with pgpool as a load balancer on master server only. The replication is going OK and there is no delay on the process. The problem is that the master server is receiving more request than the slave even when I have configured a balance different from 50% for each server.
This is the pgpool show_pool_nodes with backend weigth M(1)-S(2)
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+-------------+------+--------+-----------+---------+------------+-------------------+-------------------
0 | master-ip | 9999 | up | 0.333333 | primary | 56348331 | false | 0
1 | slave-ip | 9999 | up | 0.666667 | standby | 3691734 | true | 0
as you can appreciate the master server is receiving +10x request than slave
This is the pgpool show_pool_nodes with backend weigth M(1)-S(5)
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+-------------+------+--------+-----------+---------+------------+-------------------+-------------------
0 | master-ip | 9999 | up | 0.166667 | primary | 10542201 | false | 0
1 | slave-ip | 9999 | up | 0.833333 | standby | 849494 | true | 0
The behave is quite similar when I assign M(1)-S(1)
Now I wonder if I miss understood the pgpool functioning:
Pgpool only balances read queries(as write queries are sent to
master always)
Backend Weight parameter is assigned to calculate distribution only
in balancing mode. As greater the value is more likely to be chosen
for pgpool, so if a server has a greater lb_weight it would be
selected more times than others with lower values.
If I'm right why is happening this?
Is there a way that I can actually assign a proper balancing configuration of select_cnt queries? My intention is to overcharge the slave with read queries and let to master only a "few" read queries as it is taking all the writing.
You are right on pgpool load balancing. There could be some reasons why this doesn't seem to work. For start, notice that you have the same port number for both backends. Try configuring your backend connection settings like shown in the sample pgpool.conf: https://github.com/pgpool/pgpool2/blob/master/src/sample/pgpool.conf.sample (lines 66-87), (where you also set the weights to your needs) and assign different port numbers to each backend.
Also check (assuming your running mode is master/slave):
load_balance_mode = on
master_slave_mode = on
-- changes require restart
There is a relevant FAQ entry " It seems my pgpool-II does not do load balancing. Why?" here: https://www.pgpool.net/mediawiki/index.php/FAQ (if pgpool version 4.1 also consider statement_level_load_balance). So far, i have assumed that the general conditions for load balancing (https://www.pgpool.net/docs/latest/en/html/runtime-config-load-balancing.html) are met.
You can try to adjust below one configs in pgpool.conf file:
1. wal lag delay size
delay_threshold = 10000000
it is used to let pgpool know if the slave postgresql wal is too delay to use. Change large more query can be pass to slave. Change small more query will go to master.
Besides, the pgbench testing parameter is also key. Use -C parameter, it will let connection per query, otherwise connection per session.
pgpoll load balance decision making depends of a matrix of parameter combination. not only a single parameter
Here is reference.
https://www.pgpool.net/docs/latest/en/html/runtime-config-load-balancing.html#GUC-LOAD-BALANCE-MODE

Is there a delay between a SET and a GET with the same key in Redis?

I have three processes on one computer:
A test (T)
A nginx server with my own module (M) --- the test starts and stops this process between each test case section
A Redis server (R), which is always running --- the test does not handle the start/stop sequence of this service (I'm testing my nginx module, not Redis.)
Here is a diagram of the various events:
T M R
| | |
O-------->+ FLUSHDB
| | |
+<--------O (FLUSHDB acknowledge as successful)
| | |
O-------->+ SET key value
| | |
+<--------O (SET acknowledge as successful)
| | |
O--->+ | Start nginx including my module
| | |
| O--->+ GET key
| | |
| +<---O (SUCCESS 80% and FAILURE 20%)
| | |
The test clears the Redis database with FLUSHDB then adds a key with SET key value. The test then starts nginx including my module. There, once in a while, the nginx module GET key action fails.
Note 1: I am not using the ASync implementation of Redis.
Note 2: I am using the C library hiredis.
Is it possible that there would be a delay between a SET and a following GET with the same key which would explain that this process would fail once in a while? Is there a way for me to ensure that the SET is really done once the redisCommand() function returns?
IMPORTANT NOTE: if I run one such test and the GET fails in my nginx module, the key appears in my Redis:
redis-cli
127.0.0.1:6379> KEYS *
1) "8b95d48d13e379f1ccbcdfc39fee4acc5523a"
127.0.0.1:6379> GET "8b95d48d13e379f1ccbcdfc39fee4acc5523a"
"the expected value"
So the
SET "8b95d48d13e379f1ccbcdfc39fee4acc5523a" "the expected value"
worked as expected. Only the GET failed and I would assume that it is because it somehow occurred too quickly. Any idea how to tackle this problem?
No, there is no delay between set and get. What you are doing should work.
Try running the monitor command in a separate window. When it fails - does the set command come before/after the get command?

Opendaylight Boron : Config Shard not getting created and Circuit Breaker Timed out

We are using ODL Boron - SR2. We observe a strange behavior of "Config" Shard not getting created when we start ODL in cluster mode in RHEL 6.9. We observe Circuit Breaker Timed Out exception. However "Operational" shard is getting created without any issues. Due to unavailability of "Config" shard we are unable to persist anything in "Config" tree. We checked in JMX console and "Shards" is missing.
This is consistently reproducible in RHEL, however it works in CentOS.
2018-04-04 08:00:38,396 | WARN | saction-29-31'}} | 168 - org.opendaylight.controller.config-manager - 0.5.2.Boron-SR2 | DeadlockMonitor$DeadlockMonitorRunnable | ModuleIdentifier{factoryName='runtime-generated-mapping', instanceName='runtime-mapping-singleton'} did not finish after 26697 ms
2018-04-04 08:00:38,396 | WARN | saction-29-31'}} | 168 - org.opendaylight.controller.config-manager - 0.5.2.Boron-SR2 | DeadlockMonitor$DeadlockMonitorRunnable | ModuleIdentifier{factoryName='runtime-generated-mapping', instanceName='runtime-mapping-singleton'} did not finish after 26697 ms
2018-04-04 08:00:40,690 | ERROR | lt-dispatcher-30 | 216 - com.typesafe.akka.slf4j - 2.4.7 | Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$1 | Failed to persist event type [org.opendaylight.controller.cluster.raft.persisted.UpdateElectionTerm] with sequence number [4] for persistenceId [member-2-shard-default-config].
akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out.
2018-04-04 08:00:40,690 | ERROR | lt-dispatcher-30 | 216 - com.typesafe.akka.slf4j - 2.4.7 | Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$1 | Failed to persist event type [org.opendaylight.controller.cluster.raft.persisted.UpdateElectionTerm] with sequence number [4] for persistenceId [member-2-shard-default-config].
akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out.
This is an issue with akka persistence where it times out trying to write to the disk. See the discussion in https://lists.opendaylight.org/pipermail/controller-dev/2017-August/013781.html.

Is it possible to repair label data in Neo4j?

At some point, based on the logs, my Neo4j server got stopped up and was behaving irrationally until I restarted it. I don't have logs showing what caused the issue, but the shutdown sequence dumped several exceptions into the logs, mainly:
2015-01-29 22:10:04.204+0000 INFO [API] Neo4j Server shutdown initiated by request
...
22:10:20.911 [Thread-13] WARN o.e.j.util.thread.QueuedThreadPool - qtp313783031{STOPPING,2<=1<=20,i=0,q=0} Couldn't stop Thread[qtp313783031-22088,5,main]
2015-01-29 22:10:20.923+0000 INFO [API] Successfully shutdown Neo4j Server.
2015-01-29 22:10:20.936+0000 ERROR [org.neo4j]: Exception when stopping org.neo4j.kernel.impl.nioneo.xa.NeoStoreXaDataSource#68698d61 Failed to flush file channel /var/lib/neo4j/data/graph.db/neostore.propertystore.db
org.neo4j.kernel.impl.nioneo.store.UnderlyingStorageException: Failed to flush file channel /var/lib/neo4j/data/graph.db/neostore.propertystore.db
... stack dump of exception ...
During the three day time period before this restart, some nodes and relationships were created that are able to be found in the system but don't have a proper or full label attached to them. See the below queries for an example of the enigma.
neo4j-sh (?)$ match (b:Book {book_id: 25937}) return b;
+---+
| b |
+---+
+---+
0 row
105 ms
neo4j-sh (?)$ match (b {book_id: 25937}) return b;
+-------------------------------------------------------------+
| b |
+-------------------------------------------------------------+
| Node[97574]{book_id:25937,title:"Writing an autobiography"} |
+-------------------------------------------------------------+
1 row
189 ms
neo4j-sh (?)$ match (b {book_id: 25937}) return id(b), labels(b);
+----------------------+
| id(b) | labels(b) |
+----------------------+
| 97574 | ["Book"] |
+----------------------+
1 row
165 ms
Even if I then explicitly add the label and query for :Book again, it still doesn't return.
neo4j-sh (?)$ match (b {book_id: 25937}) set b :Book return b;
+-------------------------------------------------------------+
| b |
+-------------------------------------------------------------+
| Node[97574]{book_id:25937,title:"Writing an autobiography"} |
+-------------------------------------------------------------+
1 row
143 ms
neo4j-sh (?)$ match (b:Book {book_id: 25937}) return b;
+---+
| b |
+---+
+---+
0 row
48 ms
Also not helpful is dropping and recreating the index(es) I have on these nodes.
The fact that these labels aren't functioning properly has caused my code to begin creating second, "duplicate" (exactly the same except with working labels) nodes do to my on-demand architecture.
I'm starting to believe these nodes are never going to work. Before I dive into reproducing ALL the nodes and relationships created during that three day time span, is it possible to repair what's existing?
-- EDIT --
It seems that dropping and recreating the index DOES solve this problem. I was testing immediately after dropping and immediately after recreating, forgetting that the indexes may take time to build. The system was still misbehaving while operating with NO index, so I thought it wasn't working at all.
TL/DR: Try creating an index on the broken nodes.

Resources