I'm using the 7.7 version of Solr cluster. I have three prescale AWS
EC2 servers with a Target group, an NLB attached, and autoscaling
turned on just for the Solr cluster.
Using the command below, I created the solr collection (collection Name: abc) with 1 shard and 3 replicas. Currently, the cluster now includes autoscale servers (4 replica for collection Name: abc)
There are currently 4 active nodes in that cluster.
From the application team, a collection (collection Name: xyz) will be created from code. They are unsure of the number of active nodes in the Solr cluster. In the zookeeper properties, the replicationFactor value is set to 3.
Three replication factors were used to create the collection from the application side (one of the replica is autoscale server). Now that the load is decreasing, the autoscale server will shut down.
There are only 2 replicationFactors available at that time (collection Name: xyz). Prescale server does not automatically add new collection xyz.
Please provide a solution for this.
Create collection command:
curl "http://localhost:8983/solr/admin/collections?action=CREATE&name=test_50&numShards=1&replicationFactor=3&collection.configName=conf_product&autoAddReplicas=true"
###Below the autoscaling policy i applied for solr.
curl http://localhost:8983/solr/admin/autoscaling -H 'Content-type:application/json' -d '{ "set-cluster-policy" : [{ "replica" : "1", "shard" : "#EACH", "node" : "#ANY", }] }'
curl http://localhost:8983/solr/admin/autoscaling -H 'Content-type:application/json' -d '{ "set-trigger": {"name" : "node_added_trigger","event" : "nodeAdded","waitFor" : "5s", "preferredOperation": "ADDREPLICA", "enabled" : true, "actions" : [{ "name" : compute_plan", "class": "solr.ComputePlanAction" }, { "name" : "execute_plan", "class": "solr.ExecutePlanAction" } ] }}'
curl http://localhost:8983/solr/admin/autoscaling -H 'Content-type:application/json' -d '{ "set-trigger": { "name" : "node_lost_trigger", "event" : "nodeLost", "waitFor" : "5s", "preferredOperation": "DELETENODE", "enabled" : true, "actions" : [ { "name" : "compute_plan", "class": "solr.ComputePlanAction" }, { "name" : "execute_plan", "class": "solr.ExecutePlanAction" } ] }}'
I had previously created a 3-node docker cluster of MongoDB with port 27017 mapped to that of respective hosts.
I had then created a replica set rs0 with it's members being host1.mydomain.com:27017, host2.mydomain.com:27017 and host3.mydomain.com:27017. Please note that while creating the replica set, I had specified members with their mydomain.com addresses and not with ${IP1}:27017, etc. I had the respective DNS records set up for each host.
Thus, I could connect to this cluster with string:
mongodb+srv://admin:<pass>#host1.mydomain.com,host2.mydomain.com,host3.mydomain/admin?replicaSet=rs0
Unfortunately, I have lost access to mydomain.com as it has expired and has been scooped up by another buyer.
I can still SSH into individual hosts and log into docker containers, type mongo, then use admin; and then successfully authenticate using db.auth(<user>, <pass>). However, I cannot connect to the replica set nor can export the data out of it.
Here's what I get if I try to SSH into one of the nodes and try to access the data:
$ mongo
MongoDB shell version v3.6.8
connecting to: mongodb://127.0.0.1:27017
Implicit session: session { "id" : UUID("fc3cf772-b437-47ab-8faf-5e0d16158ff0") }
MongoDB server version: 4.4.10
> use admin;
switched to db admin
> db.auth('admin', <pass>)
1
> show dbs;
2022-07-22T13:37:38.013+0000 E QUERY [thread1] Error: listDatabases failed:{
"topologyVersion" : {
"processId" : ObjectId("62da79de34490970182aacee"),
"counter" : NumberLong(1)
},
"ok" : 0,
"errmsg" : "not master and slaveOk=false",
"code" : 13435,
"codeName" : "NotPrimaryNoSecondaryOk"
} :
_getErrorWithCode#src/mongo/shell/utils.js:25:13
Mongo.prototype.getDBs#src/mongo/shell/mongo.js:67:1
shellHelper.show#src/mongo/shell/utils.js:860:19
shellHelper#src/mongo/shell/utils.js:750:15
#(shellhelp2):1:1
> rs.slaveOk();
> show dbs;
2022-07-22T13:38:04.016+0000 E QUERY [thread1] Error: listDatabases failed:{
"topologyVersion" : {
"processId" : ObjectId("62da79de34490970182aacee"),
"counter" : NumberLong(1)
},
"ok" : 0,
"errmsg" : "node is not in primary or recovering state",
"code" : 13436,
"codeName" : "NotPrimaryOrSecondary"
} :
_getErrorWithCode#src/mongo/shell/utils.js:25:13
Mongo.prototype.getDBs#src/mongo/shell/mongo.js:67:1
shellHelper.show#src/mongo/shell/utils.js:860:19
shellHelper#src/mongo/shell/utils.js:750:15
#(shellhelp2):1:1
How do I go about this? The DB contains important data that I would like to export or simply have the cluster (or one of the mongo hosts) running again.
Thanks!
Add following records to /etc/hosts file on each container running mongodb, and the client where you are connecting from:
xxx.xxx.xxx.xxx host1.mydomain.com
yyy.yyy.yyy.yyy host2.mydomain.com
zzz.zzz.zzz.zzz host3.mydomain.com
replace xxx, yyy, zzz with actual IP addresses that listen on 27017.
If the client is Windows, the hosts file is located at %SystemRoot%\System32\drivers\etc\hosts
If the replica set restores, you will be able to connect to the database without +srv schema:
mongodb://admin:<pass>#host1.mydomain.com,host2.mydomain.com,host3.mydomain.com \
?authSource=admin&replicaSet=rs0
If you don't know network configuration, or the replica set did not recover for any reason, you still can connect to individual node as standalone instances.
Restart the mongodb without --replSet parameter in the command line (somewhere in your Dockerfile) or replication part in mongodb.conf. It will resolve the "NotPrimaryOrSecondary" error.
I followed this examples to stream data from mysql to elasticsearch
https://github.com/debezium/debezium-examples/tree/master/unwrap-smt#elasticsearch-sink
The example itself works great on my local machine.
But in my case I want to stream data from mssql (which is on another server, not docker) to elasticsearch.
So in the "docker-compose-es.yaml" file i removed "mysql" part and removed the mysql links.
And created my own connectors/sink for elastic and mssql:
{
"name": "Test-connector",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"database.hostname": "192.168.1.234",
"database.port": "1433",
"database.user": "user",
"database.password": "pass",
"database.dbname": "Test",
"database.server.name": "MyServer",
"table.include.list": "dbo.TEST_A",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "dbhistory.testA"
}
}
{
"name": "elastic-sink-test",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"tasks.max": "1",
"topics": "TEST_A",
"connection.url": "http://localhost:9200/",
"transforms": "unwrap,key",
"transforms.unwrap.type": "io.debezium.transforms.UnwrapFromEnvelope",
"transforms.unwrap.drop.tombstones": "false",
"transforms.key.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.key.field": "SQ",
"key.ignore": "false",
"type.name": "TEST_A",
"behavior.on.null.values": "delete"
}
}
When adding these the kafka connect I/O is working hard and has over 40GB input see image below:
In the kafka logs it looks like its going through all the tables. Here is one of the table logs:
2021-06-17 10:20:10,414 - INFO [data-plane-kafka-request-handler-5:Logging#66] - [Partition MyServer.dbo.TemplateGroup-0 broker=1] Log loaded for partition MyServer.dbo.TemplateGroup-0 with initial high watermark 0
2021-06-17 10:20:10,509 - INFO [data-plane-kafka-request-handler-3:Logging#66] - Creating topic MyServer.dbo.TemplateMeter with configuration {} and initial partition assignment Map(0 -> ArrayBuffer(1))
2021-06-17 10:20:10,516 - INFO [data-plane-kafka-request-handler-3:Logging#66] - [KafkaApi-1] Auto creation of topic MyServer.dbo.TemplateMeter with 1 partitions and replication factor 1 is successful
2021-06-17 10:20:10,526 - INFO [data-plane-kafka-request-handler-7:Logging#66] - [ReplicaFetcherManager on broker 1] Removed fetcher for partitions Set(MyServer.dbo.TemplateMeter-0)
2021-06-17 10:20:10,528 - INFO [data-plane-kafka-request-handler-7:Logging#66] - [Log partition=MyServer.dbo.TemplateMeter-0, dir=/kafka/data/1] Loading producer state till offset 0 with message format version 2
The database is only 2GB. I'm not sure why it has so high input.
No test_a index was created in elasticsearch when running this command:
curl http://localhost:9200/_aliases?pretty=true
Does anyone know how I troubleshoot from here or point me to the right direction?
Thanks in advance!
how I troubleshoot from here
docker compose logs?
Modify the log4j.properties of Kafka Connect and/or Elasitcsearch processes to get more logs?
Use a regular Kafka consumer to see if data is actually read into the TEST_A topic?
in the "docker-compose-es.yaml" ....
If Debezium is running in a container, then Elasticsearch is not available at localhost:9200
Change that value to http://elastic:9200, like shown in the es-sink.json
We have a mongoDb cluster with 3 shards, each shard is a replica set contains 3 nodes, the mongoDb version we use is 3.2.6. we have a big database with size about 230G, which contains about 5500 collections. we found that about 2300 collections are not balanced where other 3200 collections are evenly distributed to 3 shards.
below is the result of sh.status (the whole result is too big, i just post part of them):
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"minCompatibleVersion" : 5,
"currentVersion" : 6,
"clusterId" : ObjectId("57557345fa5a196a00b7c77a")
}
shards:
{ "_id" : "shard1", "host" : "shard1/10.25.8.151:27018,10.25.8.159:27018" }
{ "_id" : "shard2", "host" : "shard2/10.25.2.6:27018,10.25.8.178:27018" }
{ "_id" : "shard3", "host" : "shard3/10.25.2.19:27018,10.47.102.176:27018" }
active mongoses:
"3.2.6" : 1
balancer:
Currently enabled: yes
Currently running: yes
Balancer lock taken at Sat Sep 03 2016 09:58:58 GMT+0800 (CST) by iZ23vbzyrjiZ:27017:1467949335:-2109714153:Balancer
Collections with active migrations:
bdtt.normal_20131017 started at Sun Sep 18 2016 17:03:11 GMT+0800 (CST)
Failed balancer rounds in last 5 attempts: 0
Migration Results for the last 24 hours:
1490 : Failed with error 'aborted', from shard2 to shard3
1490 : Failed with error 'aborted', from shard2 to shard1
14 : Failed with error 'data transfer error', from shard2 to shard1
databases:
{ "_id" : "bdtt", "primary" : "shard2", "partitioned" : true }
bdtt.normal_20160908
shard key: { "_id" : "hashed" }
unique: false
balancing: true
chunks:
shard2 142
too many chunks to print, use verbose if you want to force print
bdtt.normal_20160909
shard key: { "_id" : "hashed" }
unique: false
balancing: true
chunks:
shard1 36
shard2 42
shard3 46
too many chunks to print, use verbose if you want to force print
bdtt.normal_20160910
shard key: { "_id" : "hashed" }
unique: false
balancing: true
chunks:
shard1 34
shard2 32
shard3 32
too many chunks to print, use verbose if you want to force print
bdtt.normal_20160911
shard key: { "_id" : "hashed" }
unique: false
balancing: true
chunks:
shard1 30
shard2 32
shard3 32
too many chunks to print, use verbose if you want to force print
bdtt.normal_20160912
shard key: { "_id" : "hashed" }
unique: false
balancing: true
chunks:
shard2 126
too many chunks to print, use verbose if you want to force print
bdtt.normal_20160913
shard key: { "_id" : "hashed" }
unique: false
balancing: true
chunks:
shard2 118
too many chunks to print, use verbose if you want to force print
}
Collection "normal_20160913" is not balanced, I post the getShardDistribution() result of this collection below:
mongos> db.normal_20160913.getShardDistribution()
Shard shard2 at shard2/10.25.2.6:27018,10.25.8.178:27018
data : 4.77GiB docs : 203776 chunks : 118
estimated data per chunk : 41.43MiB
estimated docs per chunk : 1726
Totals
data : 4.77GiB docs : 203776 chunks : 118
Shard shard2 contains 100% data, 100% docs in cluster, avg obj size on shard : 24KiB
the balancer process is in running status, and the chunksize is default(64M):
mongos> sh.isBalancerRunning()
true
mongos> use config
switched to db config
mongos> db.settings.find()
{ "_id" : "chunksize", "value" : NumberLong(64) }
{ "_id" : "balancer", "stopped" : false }
And I found a lot of moveChunk error from mogos log, which might be the reason why some of the collections not well balanced, here is the latest part of them:
2016-09-19T14:25:25.427+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:25:59.620+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:25:59.644+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:35:02.701+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:35:02.728+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:42:18.232+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:42:18.256+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:42:27.101+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:42:27.112+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:43:41.889+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
I tried use moveChunk command manually, it's returns same error:
mongos> sh.moveChunk("bdtt.normal_20160913", {_id:ObjectId("57d6d107edac9244b6048e65")}, "shard3")
{
"cause" : {
"ok" : 0,
"errmsg" : "Not starting chunk migration because another migration is already in progress",
"code" : 117
},
"code" : 117,
"ok" : 0,
"errmsg" : "move failed"
}
I am not sure if too many collections created which cause migration overwhelmed? each day about 60-80 new collections will created.
I need help here to answer below questions, any hints will be great:
Why some of the collections not balanced, is it related to the big number of newly created collections?
Is there any command can check the processing migration jobs details? I got a lot of error log which shows some migration jog is running, but I can not find which is running.
Answer my own question:
Finally we found the root cause, it's an exactly same issue with this one "MongoDB balancer timeout with delayed replica", caused by abnormal replica set config.
When this issue happens, our replica set configuration as below:
shard1:PRIMARY> rs.conf()
{
"_id" : "shard1",
"version" : 3,
"protocolVersion" : NumberLong(1),
"members" : [
{
"_id" : 0,
"host" : "10.25.8.151:27018",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 1,
"host" : "10.25.8.159:27018",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 2,
"host" : "10.25.2.6:37018",
"arbiterOnly" : true,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 3,
"host" : "10.47.114.174:27018",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : true,
"priority" : 0,
"tags" : {
},
"slaveDelay" : NumberLong(86400),
"votes" : 1
}
],
"settings" : {
"chainingAllowed" : true,
"heartbeatIntervalMillis" : 2000,
"heartbeatTimeoutSecs" : 10,
"electionTimeoutMillis" : 10000,
"getLastErrorModes" : {
},
"getLastErrorDefaults" : {
"w" : 1,
"wtimeout" : 0
},
"replicaSetId" : ObjectId("5755464f789c6cd79746ad62")
}
}
There are 4 nodes inside the replica set: one primary, one slave, one arbiter and one 24 hours delayed slave. that makes 3 nodes to be majority, since arbiter have no data present, balancer need to wait the delayed slave to satisfy the write concern(make sure the receiver shard have received the chunk).
There are several ways to solve the problem. We just removed the arbiter, the balancer works fine now.
I'm going to speculate but my guess is that your collections are very imbalanced and are currently being balanced by chunk migration (It might take a long time). Hence your manual chunk migration is queued but not executed right away.
Here are a few points that might clarify a bit more:
One chunk at a time: MongoDB chunk migration happens in a queue mechanism and only one chunk at a time are migrated.
Balancer lock: Balancer lock information might give you some more idea of what is being migrated. You should also be able to see log entries is chunk migration in your mongos log files.
One option you have is to do some pre-splitting in your collections. The pre-splitting process essentially configured an empty collection to start balanced and avoid being imbalanced in the first place. Because once they get imbalanced the chunk migration process might not be your friend.
Also, you might want to revisit your shard keys. You are probably doing something wrong with your shard keys that's causing a lot of imbalance.
Plus, your data size doesn't seem too large to me to warrant a sharded configuration. Remember, never do a sharded configuration unless you are forced by your data size/working set size attributes. Because sharding is not free (you are probably already feeling the pain).
Trying to test automatic failover using Mongoid 4.0.2 gem and using MongoDB 2.4.3
To simulate this I'm using this test code:
require 'mongoid'
class TestClass
include Mongoid::Document
store_in collection: "test", database: "test"
field :uuid, type: String
end
Mongoid.load!("config/mongoid.yml", :test)
batch = (1..100).map { |x| TestClass.new({ uuid: x }) }
batch.each_with_index { |x, i|
begin
x.save
sleep(5.seconds)
puts "Saved #{i} records" if i%10 == 0
rescue Exception => e
puts e.message
end
}
In between saves, I jumped on my MongoDB and did rs.stepDown() on the primary node of my Mongo cluster, unfortunately this results in the following errors in my test app:
See https://github.com/mongodb/mongo/blob/master/docs/errors.md
for details about this error.
Moped::Errors::OperationFailure
The operation: #<Moped::Protocol::Command
#length=68
#request_id=192
#response_to=0
#op_code=2004
#flags=[]
#full_collection_name="test.$cmd"
#skip=0
#limit=-1
#selector={:getlasterror=>1, :w=>1}
#fields=nil>
failed with error 10058: "not master"
My Mongoid configuration looks like thus:
test:
sessions:
default:
database: test_db
hosts:
- 192.168.1.10:27017
- 192.168.1.11:27017
options:
max_retries: 10
retry_interval: 1
Any idea what I'm doing wrong here? I thought the Mongoid driver would automatically detect changes in the cluster and automatically retry the request after it updates the cluster state on the client / Ruby side?