I'm a total solr noob so I'm probably missing important information here.
Solr version: 10.4.2
Platform: Mac OS X
I'm attempting to add about 5000 documents to an empty index. Documents have 4 fields:
id (string, indexed, stored)
title (solr.TextField, indexed, not stored)
keywords (solr.TextField, multi values, indexed, not stored)
content (solr.TextField, indexed, not stored)
I'm using update/json to insert the documents in batches of 100 in a tight loop (making a new HTTP request to the update/json endpoint for each batch). The problem gets better if I add, e.g., a 100ms delay between each request. If I delay a full second it goes away completely, but this is obviously unacceptably slow.
I have worked around it by adding very short timeouts for my HTTP requests (1 second), and implementing some retry logic. It works, but of course I get annoying delays all the time as it retries.
My process often hangs waiting for solr to respond at some point during the process. For instance, if I start with a fresh core and test it right now, these are my results for each run in turn:
hang on the 45th batch, solr admin shows 3,280 documents
hang on the 52nd batch, solr admin shows 3,788 documents
hang on the 14th batch, solr admin shows 3,788 documents
hang on the 17th batch, solr admin shows 3,788 documents
successfully completes all batches, solr admin shows 4,043 documents
The log in solr admin shows no output during any of these runs. At any point after a failed or successful run I can query the index and get back reasonable results considering the data that has been added.
The update/json request handler is the one that is "implicitly added" -- it is not specified in my solrconfig.xml.
I have tried switching my locking mechanism from native to simple with no change in behavior.
Any help you can offer would be greatly appreciated. I'm not sure where to start.
Additional info:
1: It seems to hang forever. By "hang" I mean Solr never responds to the HTTP request. If I cancel the request and send it again, it generally works fine right away. I have let it wait up to about 10 minutes for a response.
2: My solrconfig.xml has this:
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<str name="dir">${solr.ulog.dir:}</str>
</updateLog>
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
</autoSoftCommit>
</updateHandler>
You did not describe the actual 'hang'. Is it hanging for a period of time or forever? That makes quite a difference.
I am assuming your actual document (content fields?) are quite large.
There might be a couple of things:
Garbage collection. If you allocated a lot of memory to Solr,
when it hits the limit, the GC could be quite long. There are Java
flags to enable GC reporting during a test run
Index merging.
Watch the data/index directory and see if the files start moving
around.
Look also in the server logs, not just on the WebUI. The
server logs will have constant chatter about what's going, UI only
shows the issues.
It's also worth checking what your commit and
soft-commit settings are (in solrconfig.xml).
Related
I current indexing takes about 1:30 hr. That is too long to wait since I wanted NRT updates, I have enabled autoCommit and autoSoftCommit as below
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:600000}</maxTime> <!-- 10 minutes -->
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:300000}</maxTime> <!-- 5 minutes -->
</autoSoftCommit>
The problem is every time full import starts, it clears old documents which defeats the purpose of enabling the autoSoftCommit. I don't know what I am missing here.
My expectation is keep documents from last index and add new documents to the existing or replace duplicate documents.
If I disable the autoSoftCommit then it does not delete the documents.
The indexing is started by cronjob. The URL is
http://localost:8983/solr/mycore/dataimport?clean=true&commit=true&command=full-import
Appreciate any help. Thanks
When you commit, you end up clearing the index if you've issued a delete. Don't issue commits if you don't want deletes to be visible. You can't have it both ways - you can't do a full index that clears the index first, and then expects the documents to appear progressively afterwards without committing the delete as well. A full import is just that - it cleans out the index, the imports any documents that currently exists, and then commits. If you want to commit earlier, that means that the cleaning of the index will be visible.
In general when talking about near realtime we're talking about submitting documents through the regular /update endpoints and having those changes be visible within a second or two. When you're using the dataimporthandler with a full-import, the whole import will have to run before any changes becomes visible.
If you still want to use the dataimporthandler (which has been removed from Solr core in 9 and is now a community project), you'll have to configure delta imports instead of using the full import support. That way you only get changes for those documents that have been added, removed or changed - and you don't have to issue the delete (the clean part of your URL) - since any deletions should be handled by your delta queries. This requires that your database has a way to track when a given row changed, so that you can only import and process those rows that actually have changed (if you want it to be efficient, at least).
If you have no way of tracking this in your database layer, you're stuck with doing it the way you're currently doing - but in that case, disable the soft commit and let the changes be visible after the import has finished.
A hybrid approach is also possible, do delta updates and manual submissions to /update during the day, then run a full index at night to make sure that Solr and your database matches. This will depend on your requirement for how quickly you need to handle any differences between Solr and your database (i.e. if you miss submitting a delete - is it critical if it doesn't get removed until late at night?)
The problem is: I tried to replace a core creating a new one with a different name, swapping and then UNLOAD the old one, but it failed.
Now, even trying to clean everything manually (unloading the cores with the AdminPanel or via curl using deleteIndexDir=true&deleteInstanceDir=true and deleting the physical diretories of both cores, nothing works.
If I UNLOAD the cores using the AdminPanel, then I don't see the cores listed anymore. But the STATUS command still returns me this:
$ curl -XGET 'http://localhost:8983/solr/admin/cores?action=STATUS&core=mycore&wt=json'
{"responseHeader":{"status":0,"QTime":0},"initFailures":{},"status":{"mycore":{"name":"mycore","instanceDir":"/var/solr/data/mycore","dataDir":"data/","config":"solrconfig.xml","schema":"schema.xml","isLoaded":"false"}}}
But, if I try to UNLOAD the core via curl:
$ curl -XGET 'http://localhost:8983/solr/admin/cores?action=UNLOAD&deleteIndexDir=true&deleteInstanceDir=true&core=mycore&wt=json'
{"responseHeader":{"status":0,"QTime":0}}
and there is no effect. I still see the core listed in the AdminPanel, the STATUS returns exactly the same and of course if I want to access the cores errors start poping up telling me that solrconfig.xml doesn't exist. Of course, nothing exists.
I know if I restart Solr everything will be fine. But I cannot restart Solr in production whenever it gets dirty alone (and it does, very often).
Some time ago I made a comment here but I didn't get any useful reply.
Now, the real problem is that in production there are other cores working and to restart Solr it takes about half an hour, which is not ok at all.
So, the question is how to clean unloaded cores properly WITHOUT restarting Solr. Please before saying "no, it's not possible" try to understand the business requirement. It MUST be possible somehow. If you know the reason why it's not possible, let's start thinking together how could it be possible.
UPDATE
I'm adding here some errors I've found looking at the logs, I hope it helps:
Solr init error
Solr create error
Solr duplicate requestid error (my script tried twice using the same id)
Solr closing index writer error
Solr error opening new searcher
I've just noticed that the error opening searcher and the one creating the core are related, both have Caused by: java.nio.file.FileAlreadyExistsException: /var/solr/data/mycore/data/index/write.lock
I have a script that, using Remote API, iterates through all entities for a few models. Let's say two models, called FooModel with about 200 entities, and BarModel with about 1200 entities. Each has 15 StringPropertys.
for model in [FooModel, BarModel]:
print 'Downloading {}'.format(model.__name__)
new_items_iter = model.query().iter()
new_items = [i.to_dict() for i in new_items_iter]
print new_items
When I run this in my console, it hangs for a while after printing 'Downloading BarModel'. It hangs until I hit ctrl+C, at which point it prints the downloaded list of items.
When this is run in a Jenkins job, there's no one to press ctrl+C, so it just runs continuously (last night it ran for 6 hours before something, presumably Jenkins, killed it). Datastore activity logs reveal that the datastore was taking 5.5 API calls per second for the entire 6 hours, racking up a few dollars in GAE usage charges in the meantime.
Why is this happening? What's with the weird behavior of ctrl+C? Why is the iterator not finishing?
This is a known issue currently being tracked on the Google App Engine public issue tracker under Issue 12908. The issue was forwarded to the engineering team and progress on this issue will be discussed on said thread. Should this be affecting you, please star the issue to receive updates.
In short, the issue appears to be with the remote_api script. When querying entities of a given kind, it will hang when fetching 1001 + batch_size entities when the batch_size is specified. This does not happen in production outside of the remote_api.
Possible workarounds
Using the remote_api
One could limit the number of entities fetched per script execution using the limit argument for queries. This may be somewhat tedious but the script could simply be executed repeatedly from another script to essentially have the same effect.
Using admin URLs
For repeated operations, it may be worthwhile to build a web UI accessible only to admins. This can be done with the help of the users module as shown here. This is not really practical for a one-time task but far more robust for regular maintenance tasks. As this does not use the remote_api at all, one would not encounter this bug.
I am working with solrcloud now, but I am facing a problem which could cause indexing process hang.
My deployment is only one collection having 5 shard running at 5 machine. Every day we will do a full index using dataimporthandler, which have 50m docs. and we trigger indexing at one of 5 machine, using distribute indexing of solrcloud.
I have founded that, sometimes one of 5 machine will die, cause of
2013-01-08 10:43:35,879 ERROR core.SolrCore - java.io.FileNotFoundException: /home/admin/index/core_p_shard2/index/_31xu.fnm (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)
at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:222)
at org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:232)
at org.apache.lucene.codecs.lucene40.Lucene40FieldInfosReader.read(Lucene40FieldInfosReader.java:52)
at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:101)
at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:57)
at org.apache.lucene.index.ReadersAndLiveDocs.getReader(ReadersAndLiveDocs.java:120)
at org.apache.lucene.index.BufferedDeletesStream.applyDeletes(BufferedDeletesStream.java:267)
at org.apache.lucene.index.IndexWriter.applyAllDeletes(IndexWriter.java:3010)
at org.apache.lucene.index.DocumentsWriter.applyAllDeletes(DocumentsWriter.java:180)
at org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:310)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:386)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1445)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:210)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:448)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:325)
at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:230)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:157)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
and I have check index dir, which does not contain _31xu.fnm indeed. I am wondering it's there some concurrent bug in distribute indexing?
As far as I konw, distribute indexing is work like this. you can send docs to any shard, and docs will forword to correct shard according to a hash id. and dataimporthandler will forward docs to correc shard using updatehandler. and finally docs will be flushed to disk via DocumentsWriterPerThread. I am wondering it's there are too much update request which sended from the shard triggered indexing caused the problem. My guess is based on that I found at the machine whild died has a lot of index segment, and each of them is very small.
I am not familiar with solr too much, may be my guess has no meaning at all, does anyone have some idea? thanks
Use a StreamingUpdateSolrServer, I used the following algorithm to re-index my huge dataset into SOLR.
Initialize StreamingUpdateSolrServer server = new StreamingUpdateSolrServer(solrServerUrl, numDocsToAddInBatch, numOfThreads);
For each Item…
-->Create document
-->Server.add(document)
When all finished,
server.commit();
server.optimize();
The problem:
Some of my items are not making it into the SOLR index, but no logs are being generated to tell me what happened.
I was able to find most of the documents, but some were missing. No errors in any logs – and I have substantial try/catch blocks with logs around all SOLRJ exceptions on the clients site.
Verify logging is not being hidden for the SOLR WAR
You will definitely want to verify that the SOLR server log settings are not hiding the fact that documents are failing to be added to the index.
Because SOLR uses the SLF4J API, your SOLR server could be over-riding the log settings allowing you to see an error message when the document failed to be indexed.
If you have a custom {solr-war}/WEB-INF/classes/logging.properties, you will need to make sure that the settings are not such that it is hiding the error messages.
By default, errors in adding an item should be shown automatically. So if you did not change your SOLR log settings at any point... you should be seeing any errors during indexing in your server log file.
Troubleshoot why Documents are failing to be indexed
In order to investigate this, it is helpful to follow verification step any time after the indexing is complete:
Initialize new log log_fromsolr
Initialize new log log_notfound
For each Item…
-->Search SOLR for the item. If SOLR has the object, log each item’s fields into log_fromsolr on a single line into log_fromsolr. This should include the unqiueKey for your document if you have one.
-->If document cannot be found in SOLR for this item, write a line to log_notfound with all the fields from the object from the database, also supplying the uniqueKey as the first line.
Once the verification step has completed, the log log_notfound created a list of all Documents that failed to be added into the Index.
You can use the log created by log_fromsolr to compare the document fields for an item that made it into the index and one that did not.
Verify it is not an intermittent issue
Sometimes it might be the case that it is not the same Items failing to be added to the index each time you try to index.
If you find objects in the log_notfound log, you will want to back up the current notfound log and run the indexing process again from scratch. Use a diff tool to see the differences between the first notfound log and the second notfound log.
An intermittent problem is evident when you see large numbers of differences in these files (Note: some differences are to be expected if new objects are being created in the database in between the first and second re-indexing).
If your problem is intermittent, it most certainly points at the application code with respect to your SOLR transactions not being committed correctly.
The same documents consistently come up missing each time it indexes
At this point we have to compare documents that are being found from the SOLR index, versus documents that are not getting into the Lucene index. Usually a field-by-field comparison of the object will start turning of some suspicious values that may be causing issues when adding the document to the index.
Try eliminating all the suspicious fields and then re-indexing the entire thing again. See if the documents are still failing to be indexed. If this worked, you will want to start re-introducing the fields that you removed and see if you can pinpoint the one that is the issue.