We have some sites which use solr as an internal search. This is done with the extension ext:solr from DKD. Within the extension there is an install script which provides core for multiple languages.
This is working well on most systems.
Meanwhile we have some bigger sites and as there are some specialities we get problems:
We have sites which import data on a regulary base from outside of TYPO3. To get the solr index up to date we need to rebuild the complete index (at night). But as the site gets bigger the reindex takes longer and longer. And if an error occurs the index is broken the next day.
You could say: no problem just refresh all records, but that would leave information in the index for records which are deleted meanwhile (there is no 'delete' information in the import, except that a deleted record is no longer in the import. So a complete delete of all records before the import (or special marking and explicit deletion afterwards) is necessary.
Anyway, the reindex takes very long and can't be triggered any time. And an error leaves the index incomplete.
In theory there is the option to work with two indices: one which is build up anew and the other one is used for search requests. In this way you always have a complete index, so it might be not up to date. After the new index is build you can swap the indices and rebuild the older one.
That needs to be triggered from inside of TYPO3, but I have not found anything about such a configuration.
Another theoretic option might be a master-slave configuration, but as far as I think about it:
when the index of master is reset to rebuild it, this reset would be synchronized to slave which looses all the information it should provide until the rebuild is complete.
(I think the problem is independent of a specific TYPO3 or solr version, so no version tag)
you know about our read and write concept introduced in EXT:Solr 9 https://docs.typo3.org/p/apache-solr-for-typo3/solr/11.0/en-us/Releases/solr-release-9-0.html#support-to-differ-between-read-and-write-connections ?
Isn't it something for your case?
The only one thing what you need is to setup it in deployment properly.
If your fresh Index is finalized and fine and not broken, you just switch the read core to read from previous write one.
Related
I current indexing takes about 1:30 hr. That is too long to wait since I wanted NRT updates, I have enabled autoCommit and autoSoftCommit as below
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:600000}</maxTime> <!-- 10 minutes -->
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:300000}</maxTime> <!-- 5 minutes -->
</autoSoftCommit>
The problem is every time full import starts, it clears old documents which defeats the purpose of enabling the autoSoftCommit. I don't know what I am missing here.
My expectation is keep documents from last index and add new documents to the existing or replace duplicate documents.
If I disable the autoSoftCommit then it does not delete the documents.
The indexing is started by cronjob. The URL is
http://localost:8983/solr/mycore/dataimport?clean=true&commit=true&command=full-import
Appreciate any help. Thanks
When you commit, you end up clearing the index if you've issued a delete. Don't issue commits if you don't want deletes to be visible. You can't have it both ways - you can't do a full index that clears the index first, and then expects the documents to appear progressively afterwards without committing the delete as well. A full import is just that - it cleans out the index, the imports any documents that currently exists, and then commits. If you want to commit earlier, that means that the cleaning of the index will be visible.
In general when talking about near realtime we're talking about submitting documents through the regular /update endpoints and having those changes be visible within a second or two. When you're using the dataimporthandler with a full-import, the whole import will have to run before any changes becomes visible.
If you still want to use the dataimporthandler (which has been removed from Solr core in 9 and is now a community project), you'll have to configure delta imports instead of using the full import support. That way you only get changes for those documents that have been added, removed or changed - and you don't have to issue the delete (the clean part of your URL) - since any deletions should be handled by your delta queries. This requires that your database has a way to track when a given row changed, so that you can only import and process those rows that actually have changed (if you want it to be efficient, at least).
If you have no way of tracking this in your database layer, you're stuck with doing it the way you're currently doing - but in that case, disable the soft commit and let the changes be visible after the import has finished.
A hybrid approach is also possible, do delta updates and manual submissions to /update during the day, then run a full index at night to make sure that Solr and your database matches. This will depend on your requirement for how quickly you need to handle any differences between Solr and your database (i.e. if you miss submitting a delete - is it critical if it doesn't get removed until late at night?)
I am using the java post tool for solr to upload and index a directory of documents. There are several thousand documents. Solr only does a commit at the very end of the process and sometimes things stop before it completes so I lose all the work.
Has anyone a technique to fetch the name of each doc and call post on that so you get the commit for each document? Rather than the large commit of all the docs at the end?
From the help page for the post tool:
Other options:
..
-params "<key>=<value>[&<key>=<value>...]" (values must be URL-encoded; these pass through to Solr update request)
This should allow you to use -params "commitWithin=1000" to make sure each document shows up within one second of being added to the index.
Committing after each document is an overkill for the performance, in any case it's quite strange that you had to resubmit anything from start if something goes wrong. I suggest to seriously to change the indexing strategy you're using instead of investigating in a different way to commit.
Given that, if you not have any other way that change the commit configuration, I suggest to configure autocommit in your Solr collection/index or use the parameter commitWithin, as suggested by #MatsLindh. Just be aware if the tool you're using has the chance to add this parameter.
autoCommit
These settings control how often pending updates will be automatically pushed to the index. An alternative to autoCommit is
to use commitWithin, which can be defined when making the update
request to Solr (i.e., when pushing documents), or in an update
RequestHandler.
database=admin command={:ismaster=>1} (4002.9714ms)
i see this happening every ms maybe I'm exaggerating it happens every 1s.
The problem is that when you have replicaset located in different physical locations you get slow responses on that query thats happening every second on all replicasets(btw, replicasets are not defined as hosts in mongoid config, but it figures them out from the replica set config, and starts the crazy task).
1-I'm wondering if i can turn that option off?
I believe this is happening to keep mongoid up to date on which node it should eventually write to. I believe this is an old version of Mongoid 3.0*. the newer versions should not have that problem.
I'm not also sure if newer versions of mongoid handles primary changes correctly.
I must say its always better to use the language driver! mongo ruby driver in this case!
I have to implement Solr index into Sitecore and I would like to know what is the best approach?
I looked at following approaches:
Capture publish end event (or other events) and then push item to solr index
Implement custom database crawler and get all changes from history table. Then using custom index push data to solr.
Second approach sounds like a way to go (in my opinion). In this case do I need to create a new search index, or search manager?
If anyone's done it before, can you point me into the right direction? Also if you could post some links to articles about sitecore-solr implementation.
UPDATE
Ok, after reading sitecore documentation this is what I came up with :
Create your custom SolrConfiguration class where you can set properties like solrserviceurl, add indexes and its definition (custom solr indexes)
Create SolrIndex and add it (in the config file) to your SolrConfiguration. Which instantiating, solrindex should subscribe to AddEntry event of Sitecore History Manager, and communicate with solr crawlers.
Create custom processor and hook into sitecore initialisation pipeline. Processor should initialize SolrConfiguration (from step 1)
Since everything in your config file in will be build using refrection, you can get instance of your cofiguration based on your config file
How does that sound like. Can I have any comments please?
We've done this on a few sites and tend to have a new "published" solr index and "unpublished" index
We interrupt:
OnItemSaving
Event to push things into the unpublished index (you may not need this, it depends if you want things in preview mode)
OnPublishItemProcessed
We process additions and updates to the published index here, I'm not sure what we do about deletions here without digging right into the code but certainly deal with deletions on the OnItemDelete (mentioned below)
OnItemDelete
We interrupt here to remove things from the published and non-published index (I think we remove from the published index here because Sitecore makes you publish the parent node in order to publish out deletions to the web database)
I hope that helps, I'd post the code if I could (but I'd be scowled at).
In addition to the already posted answer (which I think is a good way to do things) I'll share how we do it.
We basically just took a look at the Sitecore database crawler and decided to do things kind of like how it was doing it.
We utilize a significantly modified version of the Custom Item Generator to facilitate mapping between strongly typed objects and an object that has properties that correspond to our Solr schema. For actual communication with Solr we use SolrNet.
The general idea is that we loop through all the items (starting with the site root) recursively and map them to the appropriate type based on its template. Then we go through an indexing process for that item (some items need to index multiple documents to Solr in our implementation).
This approach is working very well for us except I will note that because we are indexing everything at once, it tends to introduce a slight bit of lag time between publish and the site reflecting any changes made to the index. One oversight we made in the beginning but will be working to fix soon is that we don't have an "unpublished" index (meaning we need to publish the site to see updates). It doesn't impact our solution that much really, but I can definitely see where it would others, so keep that in mind.
We didn't particularly want to get into the deletion of items from the index so we do the indexing as a publish:end event.
I hope this additional insight helps you. As far as I know there's not a whole lot of information out there about this specific combination of products, but I can tell you it's definitely possible and quite useful.
Use a StreamingUpdateSolrServer, I used the following algorithm to re-index my huge dataset into SOLR.
Initialize StreamingUpdateSolrServer server = new StreamingUpdateSolrServer(solrServerUrl, numDocsToAddInBatch, numOfThreads);
For each Item…
-->Create document
-->Server.add(document)
When all finished,
server.commit();
server.optimize();
The problem:
Some of my items are not making it into the SOLR index, but no logs are being generated to tell me what happened.
I was able to find most of the documents, but some were missing. No errors in any logs – and I have substantial try/catch blocks with logs around all SOLRJ exceptions on the clients site.
Verify logging is not being hidden for the SOLR WAR
You will definitely want to verify that the SOLR server log settings are not hiding the fact that documents are failing to be added to the index.
Because SOLR uses the SLF4J API, your SOLR server could be over-riding the log settings allowing you to see an error message when the document failed to be indexed.
If you have a custom {solr-war}/WEB-INF/classes/logging.properties, you will need to make sure that the settings are not such that it is hiding the error messages.
By default, errors in adding an item should be shown automatically. So if you did not change your SOLR log settings at any point... you should be seeing any errors during indexing in your server log file.
Troubleshoot why Documents are failing to be indexed
In order to investigate this, it is helpful to follow verification step any time after the indexing is complete:
Initialize new log log_fromsolr
Initialize new log log_notfound
For each Item…
-->Search SOLR for the item. If SOLR has the object, log each item’s fields into log_fromsolr on a single line into log_fromsolr. This should include the unqiueKey for your document if you have one.
-->If document cannot be found in SOLR for this item, write a line to log_notfound with all the fields from the object from the database, also supplying the uniqueKey as the first line.
Once the verification step has completed, the log log_notfound created a list of all Documents that failed to be added into the Index.
You can use the log created by log_fromsolr to compare the document fields for an item that made it into the index and one that did not.
Verify it is not an intermittent issue
Sometimes it might be the case that it is not the same Items failing to be added to the index each time you try to index.
If you find objects in the log_notfound log, you will want to back up the current notfound log and run the indexing process again from scratch. Use a diff tool to see the differences between the first notfound log and the second notfound log.
An intermittent problem is evident when you see large numbers of differences in these files (Note: some differences are to be expected if new objects are being created in the database in between the first and second re-indexing).
If your problem is intermittent, it most certainly points at the application code with respect to your SOLR transactions not being committed correctly.
The same documents consistently come up missing each time it indexes
At this point we have to compare documents that are being found from the SOLR index, versus documents that are not getting into the Lucene index. Usually a field-by-field comparison of the object will start turning of some suspicious values that may be causing issues when adding the document to the index.
Try eliminating all the suspicious fields and then re-indexing the entire thing again. See if the documents are still failing to be indexed. If this worked, you will want to start re-introducing the fields that you removed and see if you can pinpoint the one that is the issue.