Cassandra nodes out of sync - NTP Out Of Sync Issue - ntp

We have a cassandra cluster of 4 nodes, and it was working perfectly. After 2 of the nodes got restarted (since they were lxcs on the same machine), those 2 nodes are not able to join the cluster and fail with the error :
ERROR [MigrationStage:1] 2014-07-06 20:34:36,994 MigrationTask.java (line 55) Can't send migration
request: node /X.X.X.93 is down.
Two of the nodes (not restarted), are showing them DN in the nodetool status, while the others (ones which got restarted), are showing the others as UN.
I've checked the gossipinfo and that is fine.
Can anybody help me on this?

I suppose you have cross_node_timeout = true and time between your servers is not in sync. You might want to check your ntp settings.
The new nodes might be dropping the requests for data that they are getting from the older nodes. Hence the ntp should be configured on all the nodes of cassandra.

Related

etcd DB cluster on kubernetes misbehaving

In my project we have etcd DB deployed on Kubernetes (this etcd is for application use, separate from the Kubernetes etcd) on on-prem. So I deployed it using the bitnami helm chart as a statefulset. Initially, at the time of deployment, the number of replicas was 1 as we wanted a single instance of etcd DB earlier.
The real problem started when we scaled it up to 3. I updated configuration to scale it up by updating the ETCD_INITIAL_CLUSTER with two new members DNS name:
etcd-0=http://etcd-0.etcd-headless.wallet.svc.cluster.local:2380,etcd-1=http://etcd-1.etcd-headless.wallet.svc.cluster.local:2380,etcd-2=http://etcd-2.etcd-headless.wallet.svc.cluster.local:2380
Now when I go inside any of etcd pod and run etcdctl member list I only get a list of member and none of them is selected as leader, which is wrong. One among three should be the leader.
Also after running for some time these pods start giving heartbeat exceeds error and server overload error:
W | etcdserver: failed to send out heartbeat on time (exceeded the 950ms timeout for 593.648512ms, to a9b7b8c4e027337a
W | etcdserver: server is likely overloaded
W | wal: sync duration of 2.575790761s, expected less than 1s
I changed the heartbeat default value accordingly, the number of errors decreased but still, I get a few heartbeat exceed errors along with others.
Not sure what is the problem here, is it the i/o that's causing the problem? If yes I am not sure how to be sure.
Will really appreciate any help on this.
I don't think 🤔 the heartbeats are the main problem, it also seems 👀 the logs that you are seeing are Warning logs. So it's possible that some heartbeats are missed here and there but your nodes are node(s) are not crashing or mirroring.
It's likely that you changed the replica numbers and your new replicas are not joining the cluster. So, I would recommend following this guide for you to add the new members to the cluster. Basically with etcdctl something like this:
etcdctl member add node2 --peer-urls=http://node1:2380
etcdctl member add node3 --peer-urls=http://node1:2380,http://node2:2380
Note that you will have to run these commands in a pod that has access to all your etcd nodes in your cluster.
You could also consider managing your etcd cluster with the etcd operator 🔧 which should be able to take care of the scaling and removal/addition of nodes.
✌️
Okay, I had two problems:
"failed to send out heartbeat" Warning messages.
"No leader election".
Next day i found out the reason of second problem, actually i had startup parameter set in the pod definition.
ETCDCTL_API: 3
so when i run "etcdctl member list" with APIv3 it doesn't mention which member is selected as reader.
$ ETCDCTL_API=3 etcdctl member list
3d0bc1a46f81ecd9, started, etcd-2, http://etcd-2.etcd-headless.wallet.svc.cluster.local:2380, http://etcd-2.etcd-headless.wallet.svc.cluster.local:2379, false
b6a5d762d566708b, started, etcd-1, http://etcd-1.etcd-headless.wallet.svc.cluster.local:2380, http://etcd-1.etcd-headless.wallet.svc.cluster.local:2379, false
$ ETCDCTL_API=2 etcdctl member list
3d0bc1a46f81ecd9, started, etcd-2, http://etcd-2.etcd-headless.wallet.svc.cluster.local:2380, http://etcd-2.etcd-headless.wallet.svc.cluster.local:2379, false
b6a5d762d566708b, started, etcd-1, http://etcd-1.etcd-headless.wallet.svc.cluster.local:2380, http://etcd-1.etcd-headless.wallet.svc.cluster.local:2379, true
So when i use APIv2 i can see which node is elected as leader and there were no problem with leader election. Still working on heartbeat warning but i guess i need to tune the config in order to avoied that.
NB: I have 3 nodes, stopped one for testing.

Issue with Collection backup and restore

I have a Solr Cloud with 2 node cluster. It has 2 replicas one on each node with a single shard.
The cores created are {collection_name}_shard1_replica1 and {collection_name}_shard1_replica2.
When I perform a collection backup, and restore into a new collection, the documents are indexed properly on both the nodes, However the cores created are named differently {collection_name}_shard1_replica0 and {collection_name}_shard1_replica1
Additionally, when I delete or add documents it gets only deleted from one node which means the replication does not work. I also noticed on one node I do not have the index folder on one of the nodes from where document is not getting deleted or added.
What could I be possibly doing wrong?
So all those interested in the solution, a restart of all the nodes sequentially helped (still not able to digest why it was required and is missing in the documentation).

Hibernate4, Grails 2.5 -- cached data persists between restarts?

I'm running into a strange issue with Hibernate4 caching in a Grails 2.5.0 application that is serving as a platform for data migrated from a legacy system. The migration involves direct database inserts and removals (while testing migration SQL) of database records. These operations are causing pageload errors in the system because cached data is different from the actual state of the database. Stacktrace errors on a particular failed page load indicate missing records whose IDs are not currently referenced by anything in the database via foreign key. For example, one page fails to render with the following error:
018-02-27 10:16:32,495 http-bio-8080-exec-8 | ERROR StackTrace | superAdmin | Full Stack Trace:
org.hibernate.UnresolvableObjectException: No row with the given identifier exists: [com.tlc.worx.company.CompanyQuestion#48466]
at org.hibernate.UnresolvableObjectException.throwIfNull(UnresolvableObjectException.java:68)
at org.hibernate.event.internal.DefaultRefreshEventListener.onRefresh(DefaultRefreshEventListener.java:179)
at org.hibernate.event.internal.DefaultRefreshEventListener.onRefresh(DefaultRefreshEventListener.java:61)
at org.hibernate.internal.SessionImpl.fireRefresh(SessionImpl.java:1121)
at org.hibernate.internal.SessionImpl.refresh(SessionImpl.java:1094)
at org.hibernate.internal.SessionImpl.refresh(SessionImpl.java:1089)
at org.codehaus.groovy.grails.orm.hibernate.GrailsHibernateTemplate$10.doInHibernate(GrailsHibernateTemplate.java:342)
at org.codehaus.groovy.grails.orm.hibernate.GrailsHibernateTemplate.doExecute(GrailsHibernateTemplate.java:188)
at org.codehaus.groovy.grails.orm.hibernate.GrailsHibernateTemplate.refresh(GrailsHibernateTemplate.java:339)
at org.codehaus.groovy.grails.orm.hibernate.GrailsHibernateTemplate.refresh(GrailsHibernateTemplate.java:335)
at org.codehaus.groovy.grails.orm.hibernate.HibernateGormInstanceApi.refresh(HibernateGormInstanceApi.groovy:150)
at com.tlc.worx.company.CompanyQuestion.refresh(CompanyQuestion.groovy)
at com.tlc.worx.company.CompanyQuestion$refresh.call(Unknown Source)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:45)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:110)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:114)
at com.tlc.worx.checklist.CompanyQuestionController$_index_closure1$_closure2$_closure3.doCall(CompanyQuestionController.groovy:49)
at com.tlc.worx.checklist.CompanyQuestionController$_index_closure1$_closure2$_closure3.doCall(CompanyQuestionController.groovy)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
--
A search of the database for the record number being referenced reveals it only in a completely unrelated field and a Tally table:
(Note the page hitting the error does not have to do with the missing record itself, only an object that could be related to a CompanyQuestion but is currently NOT related to it).
I suspected a Hibernate caching issue, especially since this has coincided with the removal of records. Furthermore, migrating the same database to another environment for testing does not give rise to the same error on the new environment--corroborating my theory that this is related to environment-specific caching. But oddly, a Tomcat7 restart (app runs on Tomcat) within the original environment does not cause the problem to go away. Hibernate configuration is as follows:
hibernate {
cache.use_second_level_cache = true
cache.use_query_cache = false
cache.region.factory_class = 'org.hibernate.cache.ehcache.SingletonEhCacheRegionFactory' // Hibernate 4
singleSession = true // configure OSIV singleSession mode
flush.mode = 'auto' // pre-Hibernate4 default behavior was auto, so we'll stick with that for now. See https://grails.org/2.4.3+Release+Notes
}
It's the restart not causing the issue to disappear that has me scratching my head--is this normal Hibernate behavior to cache things for eternity even between Tomcat restarts? Am I missing the mark entirely here? My next step is running the application in the first environment with the second level cache disabled, but I would like to get community feedback as well that I am at least on the right track regarding my theory--it seems crazy. Any recommendations/feedback appreciated!
Wanted to close this up as I eventually found the issue.
Our application utilises ElasticSearch for compiling a set of data to query for searches. We have always reindexed ElasticSearch on app restart, and not run into this issue before, however I learned that the reindex operation does not always do exactly what we want to do and can actually index old data as new records, leading to either duplicates or a mixed bag of good/bad records.
The error in my question occurred when hitting one of the "bad", stale records from a previous reindex. Purging all ElasticSearch indices prior to reindexing resolved the issue.

Intermittent connection timeouts to Solr server using SolrNet

I have a production webserver hosting a search, and another machine which hosts the Solr search server (on a subnet which is in the same room, so no network problems). All is fine >90% of the time, but I consistently get a small number of The operation has timed out errors.
I've increased the timeout in the SolrNet init to 30 seconds (!)
SolrNet.Startup.Init<SolrDataObject>(
new SolrNet.Impl.SolrConnection(
System.Configuration.ConfigurationManager.AppSettings["URL"]
) {Timeout = 30000}
);
...but all that happened is I started getting this message instead of Unable to connect to the remote server which I was seeing before. It seems to have made no difference to the amount of timeout errors.
I can see nothing in any log (believe me I've looked!) and clearly my configuration is correct because it works most of the time. Anyone any ideas how I can find more information on this problem?
EDIT:
I have now increased the number of HttpRequest connections from 2 to 'a large number' (I see up to 10 connections) - but this has had no discernible effect on this problem.
The firewall is set to allow ANY connections between the two machines.
We've also checked the hardware with our server host and there are no problems on the connections, according to them.
EDIT 2:
We're still seeing this issue.
We're now logging the timeouts and they're mostly just over 30s - which is the SolrNet layer's timeout; some are 20s, though - which is the Tomcat default timeout period - which suggests it's something in the actual connection between the machines.
Not sure where to go from here, though - they're on a VLAN and we're specifically using the VLAN address - response time from pings is ALWAYS <1ms.
Without more information, I can only guess a few possible reasons:
You're fetching tons of documents per query, and it times out while transferring data.
You're hitting the ServicePoint.ConnectionLimit. If so, just increase this value. See also How can I programmatically remove the 2 connection limit in WebClient
You have some very facet-heavy requests or misusing Solr (e.g. not using filter queries). Check the qtime in the response. See the Solr performance wiki for more details.
Try setting this in .net.
ServicePointManager.Expect100Continue = false;
or this
ServicePointManager.SetTcpKeepAlive(true, 200000, 200000); - this sends requests to the server to keep the connection alive.

Drupal website blocked because of many connection errors - website goes offline

From time to time, the number of database connections from our Drupal 6.20 system to our Mysql database reaches 100-150 and after a while the website goes offline. The error message when trying to connect to Mysql manually is "blocked because of many connection errors. Unblock with 'mysqladmin flush-hosts'". Since the database is hosted on an Amazon RDS I don't have the permission to issue this command, but I can reboot the database and once rebooted the website works normally again. Until next time.
Drupal reports multiple errors prior to going offline, of two types:
Duplicate entry
'279890-0-all' for key
'PRIMARY' query:
node_access_write_grants /* Guest :
node_access_write_grants */ INSERT
INTO node_access (nid, realm, gid,
grant_view, grant_update,
grant_delete) VALUES (279890,
'all', 0, 1, 0, 0) in
/var/www/quadplex/drupal-6.20/modules/node/node.module
on line 2267.
Lock wait timeout exceeded; try
restarting transaction query:
content_write_record /* Guest :
content_write_record */ UPDATE
content_field_rating SET vid = 503621,
nid = 503621, field_rating_value =
1212 WHERE vid = 503621 in
/var/www/quadplex/drupal-6.20/sites/all/modules/cck/content.module
on line 1213.
The nids in these two queries are always the same and refer to two nodes that are frequently automatically updated by a custom module. I can track down a correlation between these errors and unusually many web requests in the Apache logs. I would understand that the website would become slower because of this. But:
Why do these errors occur, and how can they be solved? It seems to me it's to do with several web requests trying to update the same node at the same time. But surely Drupal should deal with this by locking the tables etc? Or should I deal with it in some special way?
Despite the higher web load, why does the database completely lock and require to be rebooted? Wouldn't it be better if the website still had access to Mysql and so, once the load is lower, it can serve pages again? Is there some setting for this?
Thank you!
Can be solved one or all of these three things to check:
are you out of disk space? From ssh, type command df -h and make sure you still have disk space.
Are the tables damaged? Repair the tables in phpMyAdmin, or CLI instructions here: http://dev.mysql.com/doc/refman/5.1/en/repair-table.html
Have you performance-tuned your mysql with an /etc/my.cnf? See this for more ideas: http://drupal.org/node/51263

Resources