Oak RDB Document Node Strore Write is slow - jackrabbit

Apache jackrabbit oak RDB Document node store has been used
and RDB type is Postgres.
Environment Details
Apache jackrabbit oak 1.42.0
Java 11
Issue : Loading is extremely slow (10x) when compared with Segment node store(tar) .
We have huge number of documents(800 GB) to migrate and have to use RDB Document node store only.
Repo initialization code snippet for reference
RDBDocumentNodeStoreBuilder rdbDocumentNodeStoreBuilder = RDBDocumentNodeStoreBuilder
.newRDBDocumentNodeStoreBuilder().setRDBConnection(dataSource);
rdbDocumentNodeStoreBuilder.setBlobStore(createBlobStore());
DocumentNodeStore store = rdbDocumentNodeStoreBuilder.build();
Repository repository = new Jcr(new Oak(store).with(createSecurityProvider()).withAtomicCounter())
.createRepository();
In thread dumps and most of the threads are got stuck at socket.read or socket write .
Postgres is also running on the same VM where migration process is running .
For single document checkin around 90 rows getting inserted in NODES table .
How do we improve the loading speed any recommendations ?

Related

Solr AutoScaling - Add replicas on new nodes

Using Solr version 7.3.1
Starting with 3 nodes:
I have created a collection like this:
wget "localhost:8983/solr/admin/collections?action=CREATE&autoAddReplicas=true&collection.configName=my_col_config&maxShardsPerNode=1&name=my_col&numShards=1&replicationFactor=3&router.name=compositeId&wt=json" -O /dev/null
In this way I have a replica on each node.
GOAL:
Each shard should add a replica to new nodes joining the cluster.
When a node are shoot down. It should just go away.
Only one replica for each shard on each node.
I know that it should be possible with the new AutoScalling API but I am having a hard time finding the right syntax. The API is very new and all I can find is the documentation. Its not bad but I am missing some more examples.
This is how its looks today. There are many small shard each with a replication factor that match the numbers of nodes. Right now there are 3 nodes.
This video was uploaded yesterday (2018-06-13) and around 30 min. into the video there is an example of the Solr.HttpTriggerListener that can be used to call any kind of service, for example an AWS Lamda to add new nodes.
The short answer is that your goals are not not achievable today (till Solr 7.4).
The NodeAddedTrigger only moves replicas from other nodes to the new node in an attempt to balance the cluster. It does not support adding new replicas. I have opened SOLR-12715 to add this feature.
Similarly, the NodeLostTrigger adds new replicas on other nodes to replace the ones on the lost node. It, too, has no support for merely deleting replicas from cluster state. I have opened SOLR-12716 to address that issue. I hope to release both the enhancements in Solr 7.5.
As for the third goal:
Only one replica for each shard on each node.
To achieve this, a policy rule given in the "Limit Replica Placement" example should suffice. However, looking at the screenshot you've posted, you actually mean a (collection,shard) pair which is unsupported today. You'd need a policy rule like the following (following does not work because collection:#EACH is not supported):
{"replica": "<2", "collection": "#EACH", "shard": "#EACH", "node": "#ANY"}
I have opened SOLR-12717 to add this feature.
Thank you for these excellent use-cases. I'll recommend asking questions such as these on the solr-user mailing list because not a lot of Solr developers frequent Stackoverflow. I could only find this question because it was posted on the docker-solr project.

Flink add Task/JobManagers to cluster

Regarding adding new Task/JobManagers to an existing running cluster the procedure can be found here (https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/cluster_setup.html#adding-jobmanagertaskmanager-instances-to-a-cluster).
However if we shutdown the cluster and start it again the information about the added hosts will be lost.
Is it safe practice that while adding the new host to the cluster to also update and save in parallel the "masters" and "slaves" configuration files on all nodes?
Yes it is absolutely safe. The information from masters and slaves files are read only in starting scripts.

How to scale and distribute the SOLR CLOUD nodes

I have initially setup the SOLR CLOUD with two solr nodes as shown below.
I have to add a new solr node (i.e) with additional shard and same number of replicas with the existing SOLR CLUSTER nodes.
I have already gone through the SOLR scaling and distributing https://cwiki.apache.org/confluence/display/solr/Introduction+to+Scaling+and+Distribution
But the above link contains information of scaling only for SOLR standalone mode. That's the sad part.
I have started the SOLR CLUSTER nodes using the following command
./bin/solr start -c -s server/solr -p 8983 -z [zkip's] -noprompt
Kindly share the command command for creating the new shard for adding new node.
Thanks in advance.
From my knowledge am sharing this answer.
Adding a new SOLR CLOUD /SOLR CLUSTER node is that having the copy of
all the SHARDs into the new box(through replication of all SHARDs).
SHARD : The actual data is equally splitted across the number of SHARDs we create (while creating the collection).
So while adding the new SOLR CLOUD node make sure that all the SHARD should be available on the new node(RECOMENDED) or as required.
Naming Standards of SOLR CORE in SOLR CLOUD MODE/ CLUSTER MODE
Syntax:
<COLLECTION_NAME>_shard<SHARD_NUMBER>_replica<REPLICA_NUMBER>
Example
CORE NAME : enter_2_shard1_replica1
COLLECTION_NAME : enter_2
SHARD_NUMBER : 1
REPLICA_NUMBER : 1
STEPS FOR ADDING THE NEW SOLR CLOUD/CLUSTER NODE
Create a core with the common collection name as we used in the existing SOLR CLOUD nodes.
Notes while creating a new core in new node
Example :
enter_2_shard1_replica1
enter_2_shard1_replica2
From the above example the maximum value/number of repilca of the corresponding shard is 2(enter_2_shard1_replica2)
So in the new node while creating a core, give the replica numbering as 3 "enter_2_shard1_replica3" so that SOLR will take this as the third replication of the corresponding SHARD.
Note : replica numbering should be in a incremental oreder of 1
Give time to replicate the data from the existing node to the new node.

Spring Roo - Database Reverse Engineer freezes

We are new spring-roo but very familiar with RAD on PHP using Yii & Active Record.
I was able to run roo> database reverse engineer --schema to create models off an Oracle database for a proof of concept I am working on. The command line freezes since the 3rd attempt to update the schema. The difference between the first two attempts and the 3rd one is that we used the --includeTables option without knowing that it would overwrite the entire dbre.xml (instead of doing an incremental change). We have cleaned the cache and even reinstalled roo but the issue persists. Even creating a new project did not help. I can see the following in spring-roo logs:
// Spring Roo 1.3.2.RELEASE [rev 8387857] log opened at 2016-04-13 19:39:41
database properties list
// [failed] database reverse engineer --schema pfadmin --package ~.domain
Any idea or help is welcomed.
Found the solution after half a day of investigation. Spring roo performs an analyze table while reverse engineering the models. If your database is very large, then the compute statistics will take a long long long time :D
My advise export the database as DDL only not data, create an empty development database and run spring roo against to get your models.

Converting a DSE Search node to a DSE Spark node

I saw from the FAQ that a DSE node can be reprovisioned from RT mode to Hadoop mode. Is something similar supported with DSE Search and DSE Spark? I have an existing 6-node DSE Search cluster. I want to test DSE Spark but I have very limited time left for development so if possible, I'd like to skip the bootstrap process by simply restarting my cluster as an Analytics DC instead of adding new nodes in a separate DC.
UPDATE:
I tried to find an answer on my own. These are the closest that I found:
http://www.datastax.com/wp-content/uploads/2012/03/WP-DataStax-WhatsNewDSE2.pdf
http://www.datastax.com/doc-source/pdf/dse20.pdf
These documents are for a very old release of DSE. Both documents say that only RT and Analytics node can be re-provisioned. The second document even explicitly says that a Solr node cannot be re-provisioned. Unfortunately, there is no mention about re-provisioning in more recent documentations.
Can anybody confirm whether this is still true with DSE 4.5.1? (preferably with a link to a reference)
I also saw this forum thread which explains why the section about re-provisioning was removed in recent documentations. However, in my case, I plan to re-provision all of my Search nodes as Analytics node (in contrast to re-provisioning only a subset), and the re-provisioning would only be temporary
Yes you can do that. Just start it using 'dse Cassandra -k'

Resources