I current indexing takes about 1:30 hr. That is too long to wait since I wanted NRT updates, I have enabled autoCommit and autoSoftCommit as below
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:600000}</maxTime> <!-- 10 minutes -->
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:300000}</maxTime> <!-- 5 minutes -->
</autoSoftCommit>
The problem is every time full import starts, it clears old documents which defeats the purpose of enabling the autoSoftCommit. I don't know what I am missing here.
My expectation is keep documents from last index and add new documents to the existing or replace duplicate documents.
If I disable the autoSoftCommit then it does not delete the documents.
The indexing is started by cronjob. The URL is
http://localost:8983/solr/mycore/dataimport?clean=true&commit=true&command=full-import
Appreciate any help. Thanks
When you commit, you end up clearing the index if you've issued a delete. Don't issue commits if you don't want deletes to be visible. You can't have it both ways - you can't do a full index that clears the index first, and then expects the documents to appear progressively afterwards without committing the delete as well. A full import is just that - it cleans out the index, the imports any documents that currently exists, and then commits. If you want to commit earlier, that means that the cleaning of the index will be visible.
In general when talking about near realtime we're talking about submitting documents through the regular /update endpoints and having those changes be visible within a second or two. When you're using the dataimporthandler with a full-import, the whole import will have to run before any changes becomes visible.
If you still want to use the dataimporthandler (which has been removed from Solr core in 9 and is now a community project), you'll have to configure delta imports instead of using the full import support. That way you only get changes for those documents that have been added, removed or changed - and you don't have to issue the delete (the clean part of your URL) - since any deletions should be handled by your delta queries. This requires that your database has a way to track when a given row changed, so that you can only import and process those rows that actually have changed (if you want it to be efficient, at least).
If you have no way of tracking this in your database layer, you're stuck with doing it the way you're currently doing - but in that case, disable the soft commit and let the changes be visible after the import has finished.
A hybrid approach is also possible, do delta updates and manual submissions to /update during the day, then run a full index at night to make sure that Solr and your database matches. This will depend on your requirement for how quickly you need to handle any differences between Solr and your database (i.e. if you miss submitting a delete - is it critical if it doesn't get removed until late at night?)
Related
We have some sites which use solr as an internal search. This is done with the extension ext:solr from DKD. Within the extension there is an install script which provides core for multiple languages.
This is working well on most systems.
Meanwhile we have some bigger sites and as there are some specialities we get problems:
We have sites which import data on a regulary base from outside of TYPO3. To get the solr index up to date we need to rebuild the complete index (at night). But as the site gets bigger the reindex takes longer and longer. And if an error occurs the index is broken the next day.
You could say: no problem just refresh all records, but that would leave information in the index for records which are deleted meanwhile (there is no 'delete' information in the import, except that a deleted record is no longer in the import. So a complete delete of all records before the import (or special marking and explicit deletion afterwards) is necessary.
Anyway, the reindex takes very long and can't be triggered any time. And an error leaves the index incomplete.
In theory there is the option to work with two indices: one which is build up anew and the other one is used for search requests. In this way you always have a complete index, so it might be not up to date. After the new index is build you can swap the indices and rebuild the older one.
That needs to be triggered from inside of TYPO3, but I have not found anything about such a configuration.
Another theoretic option might be a master-slave configuration, but as far as I think about it:
when the index of master is reset to rebuild it, this reset would be synchronized to slave which looses all the information it should provide until the rebuild is complete.
(I think the problem is independent of a specific TYPO3 or solr version, so no version tag)
you know about our read and write concept introduced in EXT:Solr 9 https://docs.typo3.org/p/apache-solr-for-typo3/solr/11.0/en-us/Releases/solr-release-9-0.html#support-to-differ-between-read-and-write-connections ?
Isn't it something for your case?
The only one thing what you need is to setup it in deployment properly.
If your fresh Index is finalized and fine and not broken, you just switch the read core to read from previous write one.
I am using the java post tool for solr to upload and index a directory of documents. There are several thousand documents. Solr only does a commit at the very end of the process and sometimes things stop before it completes so I lose all the work.
Has anyone a technique to fetch the name of each doc and call post on that so you get the commit for each document? Rather than the large commit of all the docs at the end?
From the help page for the post tool:
Other options:
..
-params "<key>=<value>[&<key>=<value>...]" (values must be URL-encoded; these pass through to Solr update request)
This should allow you to use -params "commitWithin=1000" to make sure each document shows up within one second of being added to the index.
Committing after each document is an overkill for the performance, in any case it's quite strange that you had to resubmit anything from start if something goes wrong. I suggest to seriously to change the indexing strategy you're using instead of investigating in a different way to commit.
Given that, if you not have any other way that change the commit configuration, I suggest to configure autocommit in your Solr collection/index or use the parameter commitWithin, as suggested by #MatsLindh. Just be aware if the tool you're using has the chance to add this parameter.
autoCommit
These settings control how often pending updates will be automatically pushed to the index. An alternative to autoCommit is
to use commitWithin, which can be defined when making the update
request to Solr (i.e., when pushing documents), or in an update
RequestHandler.
I got a setup with sitecore and solr.
Im looking to gather information (the different TemplatesIds) in publishItem, and then when the publish has ended, call solr with the names which needs to be reindex.
Ive managed to get all the template IDs both using PublishItemProcessor and as a publish:itemProcessed event, where i store the template ids in the PublishContext.CustomData as a Hashset.
But how can i, when the publishing is done get this information i've gathered during publishing? I want to call solr, once, and only once, after everything is published, with information gathered during the publishing.
Hope this makes sense guys, please help out.
You don't need to make a hack to reindex indexes after a publishing.
Sitecore has out of the box this functionality.
You use index update strategies to maintain indexes. You can configure each index with a unique set of index update strategies. You should not specify more than three update strategies per index for performance reasons.
Sitecore provides a varied set of index update strategies, and you can extend this set with more strategies.
All the strategies that are delivered with Sitecore are defined under the following node in the Sitecore.ContentSearch.Solr.Index.IndexName configuration files:
<configuration ref="contentSearch/indexConfigurations/defaultSolrIndexConfiguration" />
<strategies hint="list:AddStrategy">
You need to use of these default strategies:
RebuildAfterFullPublish
OnPublishEndAsync
More information about search, indexing and crawling you can find here:
https://doc.sitecore.net/sitecore_experience_platform/setting_up__maintaining/search_and_indexing
I'm a total solr noob so I'm probably missing important information here.
Solr version: 10.4.2
Platform: Mac OS X
I'm attempting to add about 5000 documents to an empty index. Documents have 4 fields:
id (string, indexed, stored)
title (solr.TextField, indexed, not stored)
keywords (solr.TextField, multi values, indexed, not stored)
content (solr.TextField, indexed, not stored)
I'm using update/json to insert the documents in batches of 100 in a tight loop (making a new HTTP request to the update/json endpoint for each batch). The problem gets better if I add, e.g., a 100ms delay between each request. If I delay a full second it goes away completely, but this is obviously unacceptably slow.
I have worked around it by adding very short timeouts for my HTTP requests (1 second), and implementing some retry logic. It works, but of course I get annoying delays all the time as it retries.
My process often hangs waiting for solr to respond at some point during the process. For instance, if I start with a fresh core and test it right now, these are my results for each run in turn:
hang on the 45th batch, solr admin shows 3,280 documents
hang on the 52nd batch, solr admin shows 3,788 documents
hang on the 14th batch, solr admin shows 3,788 documents
hang on the 17th batch, solr admin shows 3,788 documents
successfully completes all batches, solr admin shows 4,043 documents
The log in solr admin shows no output during any of these runs. At any point after a failed or successful run I can query the index and get back reasonable results considering the data that has been added.
The update/json request handler is the one that is "implicitly added" -- it is not specified in my solrconfig.xml.
I have tried switching my locking mechanism from native to simple with no change in behavior.
Any help you can offer would be greatly appreciated. I'm not sure where to start.
Additional info:
1: It seems to hang forever. By "hang" I mean Solr never responds to the HTTP request. If I cancel the request and send it again, it generally works fine right away. I have let it wait up to about 10 minutes for a response.
2: My solrconfig.xml has this:
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<str name="dir">${solr.ulog.dir:}</str>
</updateLog>
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
</autoSoftCommit>
</updateHandler>
You did not describe the actual 'hang'. Is it hanging for a period of time or forever? That makes quite a difference.
I am assuming your actual document (content fields?) are quite large.
There might be a couple of things:
Garbage collection. If you allocated a lot of memory to Solr,
when it hits the limit, the GC could be quite long. There are Java
flags to enable GC reporting during a test run
Index merging.
Watch the data/index directory and see if the files start moving
around.
Look also in the server logs, not just on the WebUI. The
server logs will have constant chatter about what's going, UI only
shows the issues.
It's also worth checking what your commit and
soft-commit settings are (in solrconfig.xml).
Use a StreamingUpdateSolrServer, I used the following algorithm to re-index my huge dataset into SOLR.
Initialize StreamingUpdateSolrServer server = new StreamingUpdateSolrServer(solrServerUrl, numDocsToAddInBatch, numOfThreads);
For each Item…
-->Create document
-->Server.add(document)
When all finished,
server.commit();
server.optimize();
The problem:
Some of my items are not making it into the SOLR index, but no logs are being generated to tell me what happened.
I was able to find most of the documents, but some were missing. No errors in any logs – and I have substantial try/catch blocks with logs around all SOLRJ exceptions on the clients site.
Verify logging is not being hidden for the SOLR WAR
You will definitely want to verify that the SOLR server log settings are not hiding the fact that documents are failing to be added to the index.
Because SOLR uses the SLF4J API, your SOLR server could be over-riding the log settings allowing you to see an error message when the document failed to be indexed.
If you have a custom {solr-war}/WEB-INF/classes/logging.properties, you will need to make sure that the settings are not such that it is hiding the error messages.
By default, errors in adding an item should be shown automatically. So if you did not change your SOLR log settings at any point... you should be seeing any errors during indexing in your server log file.
Troubleshoot why Documents are failing to be indexed
In order to investigate this, it is helpful to follow verification step any time after the indexing is complete:
Initialize new log log_fromsolr
Initialize new log log_notfound
For each Item…
-->Search SOLR for the item. If SOLR has the object, log each item’s fields into log_fromsolr on a single line into log_fromsolr. This should include the unqiueKey for your document if you have one.
-->If document cannot be found in SOLR for this item, write a line to log_notfound with all the fields from the object from the database, also supplying the uniqueKey as the first line.
Once the verification step has completed, the log log_notfound created a list of all Documents that failed to be added into the Index.
You can use the log created by log_fromsolr to compare the document fields for an item that made it into the index and one that did not.
Verify it is not an intermittent issue
Sometimes it might be the case that it is not the same Items failing to be added to the index each time you try to index.
If you find objects in the log_notfound log, you will want to back up the current notfound log and run the indexing process again from scratch. Use a diff tool to see the differences between the first notfound log and the second notfound log.
An intermittent problem is evident when you see large numbers of differences in these files (Note: some differences are to be expected if new objects are being created in the database in between the first and second re-indexing).
If your problem is intermittent, it most certainly points at the application code with respect to your SOLR transactions not being committed correctly.
The same documents consistently come up missing each time it indexes
At this point we have to compare documents that are being found from the SOLR index, versus documents that are not getting into the Lucene index. Usually a field-by-field comparison of the object will start turning of some suspicious values that may be causing issues when adding the document to the index.
Try eliminating all the suspicious fields and then re-indexing the entire thing again. See if the documents are still failing to be indexed. If this worked, you will want to start re-introducing the fields that you removed and see if you can pinpoint the one that is the issue.