Solr Collection Creation and Custom Collection Libraries - solr

I've run into a situation where we sometimes have to completely wipe an index and then re-index a collection. This process of course takes a lot of time. I don't want to allow for any or at least extended downtime in Prod. Thus, I am researching into a way in Solr to create a new Collection that is a copy of an old collection, but without the data. I can re-index this new Collection with little or any service degradation. I then want to use aliases to point the new Collection to the alias that our clients are using so that they will start using the new Collection without even knowing it.
I'm currently running 4.2, but wondering if I shouldn't upgrade to 4.7 in order to support this better. Seems like 4.2 has most of the same Collection API support.
One of the first snags I'm hitting is that the Collection I am copying has a lib folder in it with customer libraries. If possible I want to push these out to the solrhome/lib folder so that they only get loaded once. My problem with this is that if I have different versions of say a custom data importer, then I will run into classloader issues.
Has anyone successfully implemented this kind of scenario and could provide some insight into the pitfalls and successes that you had and what worked for you?
More Details...
I have many different collections that are part of this Solr cloud. I don't want to effect any of the other collections if possible while making changes to the new copied collection.

I'm also having a similar kind of situation where I might to modify solr schema and need to re index the whole data. But I don't have much downtime in production. So, we came up with a solution like..
Let's say I have a SolrCloud1 (existing one), with collection1 (It has it's own structure). I have my application running in different machine. There is a load-balancer in between my SolrCloud1 and application.
Now, create a separate SolrCloud (say, SolrCloud2) with collection1. Maintain the same structure as it was previously. Now, do the re indexing part in this SolrCloud2. When it's done, make available the new SolrCloud under the load-balancer. When the new SolrCLoud2 is up, shut down the SolrCloud1.
Thus without any production down time you'll re index the data. Users won't be able to know anything about this. Hope this will help.

Related

Does Vespa support dynamic fields?

I am looking for any option like dynamic fields (Solr) in Vespa. I need to add new fields to the existing schema without redeployment of the whole application.
Anything related to this is mentioned in documentation of Vespa http://docs.vespa.ai/documentation/search-definitions.html#modify-search-definitions
where it is mentioned that we can add a new field and run vespa-deploy prepare and vespa-deploy activate. But isn't it resubmission of whole application? What is an option to implement it at least overhead?
http://docs.vespa.ai/documentation/search-definitions.html#modify-search-definitions is the right document, yes
altering the search definition is safe, as the deploy prepare step will output steps needed (e.g. no steps, re-start or re-feed). Most actions do not require re-start or re-feed, as Vespa is built to be production friendly, and adding fields is easy as it requires no actions.
Note that there is no support for default values for fields.
As the Vespa configuration is declarative, the full application package is submitted, but the config servers will calculate changes and deploy the delta to the nodes. This makes it easy to keep the application package config in a code repo like git - what you see in the repo is what is deployed.
It depends on what you mean by "dynamic".
If the fields number in the hundreds, are controlled by the application owners and changed daily, then changing the schema and redeploy works perfectly fine: Fields can be updated individually (even when indexed), they incur no overhead if not used, and deploying an application with new fields is cheap and do not require any restarts or anything like that.
If you need tens of thousands of fields, or the fields are added and removed by users, then Vespa does not have a solution for you out of the box. You'd need to mangle the fields into the same index by adding e.g a "myfield_" prefix to each token. This is what engines that support this do internally to make this efficient.

Handling constantly updating fields in solr and preventing constantly reloading index?

I'm running solr 6.3.0 with zookeeper/cloud.
I have a 'product' core that with a few integer/boolean fields that are getting updated constantly...a specific example would be inventory count.I have these working currently with an external file fields. This setup accomplishes my goal of not having solr constantly reindexing. However I don't like having to generate these files. I would rather handle have the come directly from my database.
It seems like atomic/partial updates is what I am looking for but I'm having trouble understanding the efficiency ramifications of these updates vs using external file fields...or perhaps there is a better approach all together.
I would appreciate any you could provide
ExternalFileField is useful when you want to change the values of a field in many/all of the docs at the same time. If that is not the case, I would just use a normal field and update as needed (with partial updates is possible). With the external file you don't need to reindex but you do need to reload the file...
Regarding efficiency, for sure, the less frequently you update your docs the better, you have to reach a balance.

Easy GUI way to add data to a Solr index

I'm working on the front end of an app that uses Solr for data storage. Currently I have an empty index, but it'd (understandably) be a lot easier for me if some dummy data was returned so I could make sure that it's output correctly on the front end.
If I was working with and RDBMS (let's say postgres) I'd open up a GUI (e.g. pgadmin) and type data manually into a few rows to achieve this goal. I have access to the Solr web interface, but I can't see any obvious call to action saying INSERT YOUR DATA HERE. The closest thing I can find to an answer on the web is this SO thread, but it's still not quite the droids easy GUI-based solution I'm looking for.
So, my question is: Is the a way to quickly and easily insert some data equivalent to the RDBMS method mentioned above?
Make sure you have defined a schema in schema.xml.
SOLR does indeed have a (limited) html GUI, which on a local installation is probably found at localhost:8983/solr (default). If you can get to the base admin page, then on the left there is a small combobox where you can select a core/collection. If you click on THAT, then you get a list of options that emerges, and you can pick 'documents' to get a similar GUI to what I think you expect from postgres/RDBMS/whatnot.
http://localhost:8983/solr/#/collection1/documents is the URL on a default SOLR installation that I have. This should work as long as you don't have default cores. (Replace collection1 with your collection name and localhost:8983 with wherever your solr is hosted/the port).

How to implement Solr into Sitecore

I have to implement Solr index into Sitecore and I would like to know what is the best approach?
I looked at following approaches:
Capture publish end event (or other events) and then push item to solr index
Implement custom database crawler and get all changes from history table. Then using custom index push data to solr.
Second approach sounds like a way to go (in my opinion). In this case do I need to create a new search index, or search manager?
If anyone's done it before, can you point me into the right direction? Also if you could post some links to articles about sitecore-solr implementation.
UPDATE
Ok, after reading sitecore documentation this is what I came up with :
Create your custom SolrConfiguration class where you can set properties like solrserviceurl, add indexes and its definition (custom solr indexes)
Create SolrIndex and add it (in the config file) to your SolrConfiguration. Which instantiating, solrindex should subscribe to AddEntry event of Sitecore History Manager, and communicate with solr crawlers.
Create custom processor and hook into sitecore initialisation pipeline. Processor should initialize SolrConfiguration (from step 1)
Since everything in your config file in will be build using refrection, you can get instance of your cofiguration based on your config file
How does that sound like. Can I have any comments please?
We've done this on a few sites and tend to have a new "published" solr index and "unpublished" index
We interrupt:
OnItemSaving
Event to push things into the unpublished index (you may not need this, it depends if you want things in preview mode)
OnPublishItemProcessed
We process additions and updates to the published index here, I'm not sure what we do about deletions here without digging right into the code but certainly deal with deletions on the OnItemDelete (mentioned below)
OnItemDelete
We interrupt here to remove things from the published and non-published index (I think we remove from the published index here because Sitecore makes you publish the parent node in order to publish out deletions to the web database)
I hope that helps, I'd post the code if I could (but I'd be scowled at).
In addition to the already posted answer (which I think is a good way to do things) I'll share how we do it.
We basically just took a look at the Sitecore database crawler and decided to do things kind of like how it was doing it.
We utilize a significantly modified version of the Custom Item Generator to facilitate mapping between strongly typed objects and an object that has properties that correspond to our Solr schema. For actual communication with Solr we use SolrNet.
The general idea is that we loop through all the items (starting with the site root) recursively and map them to the appropriate type based on its template. Then we go through an indexing process for that item (some items need to index multiple documents to Solr in our implementation).
This approach is working very well for us except I will note that because we are indexing everything at once, it tends to introduce a slight bit of lag time between publish and the site reflecting any changes made to the index. One oversight we made in the beginning but will be working to fix soon is that we don't have an "unpublished" index (meaning we need to publish the site to see updates). It doesn't impact our solution that much really, but I can definitely see where it would others, so keep that in mind.
We didn't particularly want to get into the deletion of items from the index so we do the indexing as a publish:end event.
I hope this additional insight helps you. As far as I know there's not a whole lot of information out there about this specific combination of products, but I can tell you it's definitely possible and quite useful.

How can I set "allow_dynamic_fields" on a per collection basis in Mongoid?

Mongoid only seems to have an app wide Mongoid.allow_dynamic_fields setting. But I want every collection to decide whether it wants dynamic fields or not. How can I do that?
This is not possible in mongoid 3.x yet even though there is a pull request already for it, https://github.com/mongoid/mongoid/pull/2563. This will be released on mongoid 4.0. However if you really need that feature you could checkout those 2 commits in a local folder or in a fork , an use it.

Resources