SOLR DIH cluster environment - solr

I have the solr cloud environment configured, up and running, no issues at all. But now I need to run a delta import in a loop.. every time this import process finished start another one.
Considerations:
Same DIH configuration in all nodes.
The 3 solr nodes are running behind a load balancer (the command can be executed on any of the nodes)
I don't want to execute the importer in a second node if it's running already in one node.
I would like to run the DIH as soon as the last execution finished, right away.
if one node goes down during a import, I would like to be able to say.. this is taking too long.. let's just start another import process.(if there is a way to identify the node where the process was running when it went down, so I can check it and save that information to find out the reasons.. it will be great )
I have so many events going on on the database every minute, I really need all these events(DB records) on Solr (documents up to date)
Options and thoughts
I'm thinking in using JBoss EAP 5.1 to run the external app with the TimerService, I have got a cluster and I can ensure this will run forever asking for status and restarting the DIH process in a loop.
I was taking a look and testing the DHI Event lister
<dataConfig>
<document onImportEnd="com.me.MyNotificationService">
....
</document>
</dataConfig>
com.me.MyNotificationService this can let me know when the process finished, but I still don't know how to connect it to the "Run solr import app" since this will be on a library running out of my JBoss AS container(again if the Solr node goes down I lose the notification as well ).
If there is a way to ensure this loop won't be broke. If all this is managed by the Solr cluster(and take care of situations like when a node goes down in the middle of an import) I will forget about that external "Run solr import app", but I really don't think it's possible.
It can be really useful the ability to say to the Solr cluster execute this import process on this node (let's say node 2) and then let me know when it finished or give me a way to ask for status (on that specific node 2 even if I'm asking this to the node 1, because of the load balancer )
Any recommendation and thoughts will be more than welcome.
Thanks.

Related

Start solr indexing from CI job

We use Solr 6.4.1 and implement several cores for searching. In one of core contain several entities. All steps for refreshing index start manually from UI, including the credentials of the database.
My question is can I reindex solr core with several entities from a remote console? I need create CI job for this.
And the second question is where I can specify custom parameters with database credentials for all cores on the server?
If the application has some sort of command, you could just trigger the command directly from the CI pipeline, if it's not the case and the indexing/update code is highly coupled to the UI, then you could use DataImportHandler so you configure in Solr (as described in the documentation) the credentials, the queries that Solr needs to execute, etc. And you just trigger the import handler from the CI pipeline, something like:
http://<host>:<port>/solr/<collection_name>/command=delta-import
This will start a delta-import, for some more commands check the Data Import Handler Commands section on the previous link.

Solr nodes' replication is getting stuck

We have standalone solr servers which are master and slave. Also have a full indexer job nightly. Generally, when job executed successful everything is alright. But last days, we noticed that indexer node has different document number with searching node. So, expected productions are not available in our production system. That's why we had to restart nodes and start replication manually, then problem went away. We need to prevent to occur this problem again. What do you suggest us to check or where should i look at? Indeed i think that essential error about the issue is: "SEVERE: No files to download for index generation"
Regards

Controlling what cores get loaded when solr 5 starts/restarts

I setup my solr instance to run the way I wanted. The service was restarted and all my setup was removed and 4 gettingstarted cores were loaded.
Can someone explain why this happened and what I can do to prevent it from happening again. I would like the cores that I built to be persistent.
Thanks for your help
Edit: Looking over :
https://cwiki.apache.org/confluence/display/solr/Moving+to+the+New+solr.xml+Format
I have the solr.xml setup exactly like the example. I have the core.properties files setup properly. I don't see how it is suppose to know to load the core I created.
Edit2: I found this documentation that states any core.properties files in the home folder will be used.
https://cwiki.apache.org/confluence/display/solr/Solr+Cores+and+solr.xml
Assuming Solr 5, it sounds like you run /bin/solr restart and got wrong collections. The reason to that would be that the restart command needs the same parameters as the start command, most importantly your solr home path.
Solr home is what you provided with -s parameter when you started your Solr the last time. If you did it instead by starting from an example, this guide on solr home locations should help.

Disappearing cores in Solr

I am new to Solr.
I have created two cores from the admin page, let's call them "books" and "libraries", and imported some data there. Everything works without a hitch until I restart the server. When I do so, one of these cores disappears, and the logging screen in the admin page contains:
SEVERE CoreContainer null:java.lang.NoClassDefFoundError: net/arnx/jsonic/JSONException
SEVERE SolrCore REFCOUNT ERROR: unreferenced org.apache.solr.core.SolrCore#454055ac (papers) has a reference count of 1
I was testing my query in the admin interface; when I refreshed it, the "libraries" core was gone, even though I could normally query it just a minute earlier. The contents of solr.xml are intact. Even if I restart Tomcat, it remains gone.
Additionally, I was trying to build a query similar to this: "Find books matching 'war peace' in libraries in Atlanta or New York". So given cores "books" and "libraries", I would issue "books" the following query (which might be wrong, if it is please correct me):
(title:(war peace) blurb:(war peace))
AND _query_:"{!join
fromIndex=libraries from=libraryid to=libraryid
v='city:(new york) city:(atlanta)'}"
When I do so, the query fails with "libraries" core disappears, with the above symptoms. If I re-add it, I can continue working (as long as I don't restart the server or issue another join query).
I am using Solr 4.0; if anyone has a clue what is happening, I would be very grateful. I could not find out anything about the meaning of the error message, so if anyone could suggest where to look for that, or how go about debugging this, it would be really great. I can't even find where the log file itself is located...
I would avoid the Debian package which may be misconfigured and quirky. And it contains (a very early build of?) solr 4.0, which itself may have lingering issues; being the first release in a new major version. The package maintainer may not have incorporated the latest and safest Solr release into his package.
A better way is to download Solr 4.1 yourself and set it up yourself with Tomcat or another servlet container.
In case you are looking to install SOLR 4.0 and configure, you can following the installation procedure from here
Update the solr config for the cores to be persistent.
In your solr.xml, update <solr> or <solr persistent="false"> to <solr persistent="true">

solr healthcheck for >0 documents

The default configuration for solr of /admin/ping provided for load balancer health check integrates well with the Amazon ELB load balancer health checks.
However since we're using master-slave replication when we provision a new node, Solr starts up, and replication happens, but in the meantime /admin/ping return success before the index has replicated across from master and there are documents.
We'd like nodes to only be brought live once they have done the first replication and have documents. I don't see any way of doing this with /admin/ping PingRequestHandler - it always return success if the search succeeds, even with zero results.
Nor is there anyway of matching/not matching expected text in the response with the ELB health check configuration.
How can we achieve this?
To expand on the nature of the problem here, the PingRequestHandler will always return a success unless....
Its query results in an exception being thrown.
It is configured to use a healthcheck file, and that file is not found.
Thus my suggestion is that you configure the PingRequestHandler handler to use a healthcheck file. You can then use a cron job on your Solr system whose job is to check for the existence of documents and create (or remove) the healthcheck file accordingly. If the healthcheck file is not present, the PingRequestHandler will throw a HTTP 503 which should be sufficient for ELB.
The rough algorithm that I'd use...
Every minute, query http://localhost:8983/solr/select?q=*:*
If numDocs > 0 then touch /path/to/solr-enabled
Else rm /path/to/solr-enabled (optional, depending on your strictness)
The healthcheck file can be configured in the <admin> block, and you can use an absolute path, or a filename relative to the directory from which you have started Solr.
<admin>
<defaultQuery>solr</defaultQuery>
<pingQuery>q=*:*</pingQuery>
<healthcheck type="file">/path/to/solr-enabled</healthcheck>
</admin>
Let me know how that works out! I'm tempted to implement something similar for read slaves at Websolr.
I ran into an interesting solution here: https://jobs.zalando.com/tech/blog/zookeeper-less-solr-architecture-aws/?gh_src=4n3gxh1
It's basically a servlet that you could add to the Solr webapp and then check all of the cores to make sure they have documents.
I'm toying with a more sophisticated solution but haven't tested it/made much progress yet: https://gist.github.com/er1c/e261939629d2a279a6d74231ce2969cf
What I like about this approach (in theory) is the ability to check the replication status/success for multiple cores. If anyone finds an actual implementation of this approach please let me know!

Resources