The issue I have is locking on both indexes.
To explain, when constructing the EmbeddedSolrServer server I have to parse the CoreContainer and the core name and so I have constructed two seperate instances of the EmbeddedSolrServer, one for each core. Now I essentially do this (example code):
serverInstanceOne.add(document);
serverInstanceTwo.add(document) // This fails to obtain a lock
If serverInstanceOne is purely targeting core1, why does it create a lock in the index of core2?
Is there a way to prevent this? Or a way of forcing the server to drop the lock without shutting down each time?
I have tried to find in the Solr documentation an explanation around this behaviour but am at a loss still. Essentially I am using multicore and have a spring batch job, which is using the EmbeddedSolrServer to pump data overnight into some indexes.
Related
I am working on a springboot application in a multi-threading environment where I am using Hibernate with javax.persistence.EntityManager to access database. I have separate HikariPools for read and write queries.
Here, the multiple threads while doing read operation from database (all the read queries) are using a single read connection (since I have autowired the entityManager and not using PersistanceContext). And similarly multiple threads will write to db as well with the help of writeEntityManager where a single connection is being used by all of the threads.
I am facing an issue with AbstractLockUpgradeEventListener.upgradeLock . This is happening intermittently and could not find the exact root cause for this.
Few assumptions:-
DB utilization touches 100%.( That might give an edge to this issue)
Lock is applied before executing any write query and threads are getting starved if one thread takes more than enough time
Can anyone suggest something here w.r.t design or implementation strategy or on what could be the actual root cause.
This only happens once in a while
The Hibernate EntityManager is not thread-safe, you must not use it from multiple threads.
In fact the EntityManager AND the objects loaded must not be used from multiple threads
https://discourse.hibernate.org/t/hibernate-and-multithreading/289
can we increase the Apache solr performance for importing data from mysql with dataimport ?
currently i am using :
4 core processor
RAM 16 GB
HDD 50 GB
mysql record 1,2 Millions
for now i get 20 minutes for full import the datas.
Usually the best way is to drop using DIH (which is single threaded and runs on a single node - so it won't be easily scalable).
By writing a small, custom indexer in a suitable language (or even by using the bundled post tool), you can run multiple instances of your indexer, index to different nodes (allowing your content to be processed in parallel) and keep multiple threads open to both your backend database and to Solr.
It's important that you don't use explicit commits when indexing from multiple processes or threads - since that'll kill performance when committing often. Use commitWithin instead, telling Solr to automagically issue a commit after x seconds has passed. If you have full control over when all processes / threads have finished, you can issue the commit yourself - i.e. at the end of the indexing process (unless you want documents to become visible while indexing, in that case use commitWithin).
I have a server which has a Solr Environment hosted on it. I want to run a weekly update of the data that our Solr database contains.
I have a couple solutions but I was wondering whether one is possible and if it is which one would be better:
My first solution is to have 2 Servers with a Solr environment on both and when one is updating you just switch the url using to connect to Solr and connect to the other one.
My other solution is the one I am not sure how to do. Is there a way to switch the datasource that a Solr environment looks at without restarting it or cutting out any current searches.
If anyone has any ideas it would be much appreciated.
Depending on the size of the data, you can probably just keep the Solr core running while doing the update. First issue a delete, then index the data and finally commit the changes. The new index state won't be seen before the commit is issued, which allows you to serve the old data while waiting for the indexing to complete.
Another option is to use the core admin to switch cores as you mentioned, similar to copying data into other cores (drop the mergeindex command).
If you're also talking about updating and upgrading the actual Solr version or application server while still serving content, having a second server that replicates the index from the master is an easy way to get more redundancy. That way you can keep serving queries from the second server while the first one is being maintained and then do it the other way around. Point your clients to an HTTP load balancer, and take the maintained server out of the list of servers serving requests while it's down. This will also make you resistant against single hardware failures, etc.
There's also the option of setting up SolrCloud, but that might require a bit more restructuring.
I'm trying to integrate a key-value database to Spark and have some questions.
I'm a Spark beginner, have read a lot and run some samples but nothing too
complex.
Scenario:
I'm using a small hdfs cluster to store incoming messages in a database.
The cluster has 5 nodes, and the data is split into 5 partitions. Each
partition is stored in a separate database file. Each node can therefore process
its own partition of the data.
The Problem:
The interface to the database software is based on JNI, the database itself is
implemented in C. For technical reasons, the database software can maintain
only one active connection at a time. There can be only one JVM process which
is connected to the Database.
Because of this limitation, reading from and writing to the database must go
through the same JVM process.
(Background info: the database is embedded into the process. It's file based,
and only one process can open it at a time. I could let it run in a separate
process, but that would be slower because of the IPC overhead. My application
will perform many full table scans. Additional writes will be batched and are
not time-critical.)
The Solution:
I have a few ideas in my mind how to solve this, but i don't know if they work
well with Spark.
Maybe it's possible to magically configure Spark to only have one instance of my
proprietary InputFormat per node.
If my InputFormat is used for the first time, it starts a separate thread
which will create the database connection. This thread will then continue
as a daemon and will live as long as the JVM lives. This will only work
if there's just one JVM per node. If Spark starts multiple JVMs on the
same node then each would start its own database thread, which would not
work.
Move my database connection to a separate JVM process per node, and my
InputFormat then uses IPC to connect to this process. As i said, i'd like to avoid this.
Or maybe you have another, better idea?
My favourite solution would be #1, followed closely by #2.
Thanks for any comment and answer!
I believe the best option here is to connect to your DB from driver, not from executors. This part of the system anyway would be a bottleneck.
Have you thought of queueing (buffer) then using spark streaming to dequeue and use your output format to write.
If data from your DB fits into RAM memory of your spark-driver you can load it there as a collection and then parallelize it to an RDD https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#parallelized-collections
While using Solr (we are currently using 3.5), how do we setup the Masters for a Failover?
Lets say in my Setup I have Two Masters and Two Slaves. The Application commits all the writes to One Active Master, and both the slaves get the updates from this Active Master. There is another repeater which serves the same purpose of the Master.
Now my question is if the Master for some reason comes down, how can I make the Repeater as a Master without any Manual intervention. How can the slaves start getting the updates from the Repeater instead of the broken Master. Is there a recommended way to do this? Are there any other recommended Master/Slave setup's to ensure High availability of the Solr systems?
At this time, your best option is probably to investigate the SolrCloud functionality present in the current Solr 4.0 alpha, which at the time of this writing is due for its final release within a few months. The goal of SolrCloud is to handle data distribution and master election, using the ZooKeeper distributed database to maintain consensus within the cluster about which nodes are serving in while roles.
There are other more traditional ways to set up failover for Solr 3's replicated master-slave architecture, but I personally wouldn't want to make that investment with Solr 4.0 so near to release.
Edit: See Linux-HA, for one such traditional approach. Personally, I would create a purpose-built daemon that reconfigures your cores and load balancer, using ZooKeeper for presence detection and distributed locks.
If outsourcing is an option, you might consider a hosted service such as my own humble Websolr. We provide this kind of distribution and hot failover by default, so our customers don't have to worry as much about the mechanics of how it's implemented.
I agree with Nick. The way replication works in Solr 3.x is not always handy, especially for master fail-over. If you are going to consider Solr 4 you might want to have a look at elasticsearch too, which solves this kind of problems in a really brilliant way!
It uses push replication instead of the pull mechanism used by Solr. That means the document is literally reindexed on all nodes. It might sound strange but that allows to reduce the network load (due to segment merge for example). Furthermore, a node is elected as master and if it crashes one other node will automatically replace it becoming the new master.