SolrCloud: Underlying file changed by external force? - solr

Having trouble with an issue I am unable to reproduce in solrcloud. Seems to happen at random.
Underlying file changed by an external force at 2018-09-18T14:55:22Z, (lock=NativeFSLock(path=/path/to/my/shard/index/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2018-09-18T14:55:22.006973Z)) Caused by: org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2018-09-18T14:55:21Z,
the index is schemaless and typically receives many simultaneous updates.
It usually starts with something like this: (in this order before the write.lock error occurs)
Error from server at server3/solr/myshard: Bad Request
Remote error message: Exception writing document id bc04df6e-f29f-4091-ad73-f708a97d28b4 to the index; possible analysis error.
3 Async exceptions during distributed update:
Remote error message: this IndexWriter is closed
Is there anyway to self recover from the indexwriter being closed? After the error occurs, no more documents can be written to the collection. The only solution I have right now is to either delete the write.lock and restart solr or recreate the collection completely.

Related

EF Core executing previously failed query on SaveChanges

We have a .Net Core 3.1 application which is using EF Core to connect wit the SQL Server database. We are facing one problem, that when an exception occurs in our method due to some reason, for example, a field is mandatory in DB and on SaveChanges, Exception is raised, because the value for that field was passed as null, then next time, when the same method is called with different parameters (even with all correct/mandatory fields), still the old query gets executed in EF Core (I checked this in Output window). Which is very weird and strange behavior.
If we close the application (in debug environment) after the first exception and re run the application with correct payload next time, then everything works fine. May be the EF Core is retrying the earlier failed query again for some reason? or why is this behavior occurring?
Depending on how are you getting messages from the queue, the message may come back to the queue if it is not acknowledge after a while.
With Auto Acknowledge the message will is removed permanently. If you ack after persist in the database, any exception will avoid the ack, so the message goes back to the queue.
I just figured out that the RabbitMQ consumer classes were not properly setup in the Dependency Configuration file. Basically the repository cases were instantiated as static members in the Dependency Config and when Consumers were setup for RabbitMQ, they were passed in these static Repository objects. I removed these static objects, and made use of the Scoped instances of the repository. this resolved my issue.

Hibernate4, Grails 2.5 -- cached data persists between restarts?

I'm running into a strange issue with Hibernate4 caching in a Grails 2.5.0 application that is serving as a platform for data migrated from a legacy system. The migration involves direct database inserts and removals (while testing migration SQL) of database records. These operations are causing pageload errors in the system because cached data is different from the actual state of the database. Stacktrace errors on a particular failed page load indicate missing records whose IDs are not currently referenced by anything in the database via foreign key. For example, one page fails to render with the following error:
018-02-27 10:16:32,495 http-bio-8080-exec-8 | ERROR StackTrace | superAdmin | Full Stack Trace:
org.hibernate.UnresolvableObjectException: No row with the given identifier exists: [com.tlc.worx.company.CompanyQuestion#48466]
at org.hibernate.UnresolvableObjectException.throwIfNull(UnresolvableObjectException.java:68)
at org.hibernate.event.internal.DefaultRefreshEventListener.onRefresh(DefaultRefreshEventListener.java:179)
at org.hibernate.event.internal.DefaultRefreshEventListener.onRefresh(DefaultRefreshEventListener.java:61)
at org.hibernate.internal.SessionImpl.fireRefresh(SessionImpl.java:1121)
at org.hibernate.internal.SessionImpl.refresh(SessionImpl.java:1094)
at org.hibernate.internal.SessionImpl.refresh(SessionImpl.java:1089)
at org.codehaus.groovy.grails.orm.hibernate.GrailsHibernateTemplate$10.doInHibernate(GrailsHibernateTemplate.java:342)
at org.codehaus.groovy.grails.orm.hibernate.GrailsHibernateTemplate.doExecute(GrailsHibernateTemplate.java:188)
at org.codehaus.groovy.grails.orm.hibernate.GrailsHibernateTemplate.refresh(GrailsHibernateTemplate.java:339)
at org.codehaus.groovy.grails.orm.hibernate.GrailsHibernateTemplate.refresh(GrailsHibernateTemplate.java:335)
at org.codehaus.groovy.grails.orm.hibernate.HibernateGormInstanceApi.refresh(HibernateGormInstanceApi.groovy:150)
at com.tlc.worx.company.CompanyQuestion.refresh(CompanyQuestion.groovy)
at com.tlc.worx.company.CompanyQuestion$refresh.call(Unknown Source)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:45)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:110)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:114)
at com.tlc.worx.checklist.CompanyQuestionController$_index_closure1$_closure2$_closure3.doCall(CompanyQuestionController.groovy:49)
at com.tlc.worx.checklist.CompanyQuestionController$_index_closure1$_closure2$_closure3.doCall(CompanyQuestionController.groovy)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
--
A search of the database for the record number being referenced reveals it only in a completely unrelated field and a Tally table:
(Note the page hitting the error does not have to do with the missing record itself, only an object that could be related to a CompanyQuestion but is currently NOT related to it).
I suspected a Hibernate caching issue, especially since this has coincided with the removal of records. Furthermore, migrating the same database to another environment for testing does not give rise to the same error on the new environment--corroborating my theory that this is related to environment-specific caching. But oddly, a Tomcat7 restart (app runs on Tomcat) within the original environment does not cause the problem to go away. Hibernate configuration is as follows:
hibernate {
cache.use_second_level_cache = true
cache.use_query_cache = false
cache.region.factory_class = 'org.hibernate.cache.ehcache.SingletonEhCacheRegionFactory' // Hibernate 4
singleSession = true // configure OSIV singleSession mode
flush.mode = 'auto' // pre-Hibernate4 default behavior was auto, so we'll stick with that for now. See https://grails.org/2.4.3+Release+Notes
}
It's the restart not causing the issue to disappear that has me scratching my head--is this normal Hibernate behavior to cache things for eternity even between Tomcat restarts? Am I missing the mark entirely here? My next step is running the application in the first environment with the second level cache disabled, but I would like to get community feedback as well that I am at least on the right track regarding my theory--it seems crazy. Any recommendations/feedback appreciated!
Wanted to close this up as I eventually found the issue.
Our application utilises ElasticSearch for compiling a set of data to query for searches. We have always reindexed ElasticSearch on app restart, and not run into this issue before, however I learned that the reindex operation does not always do exactly what we want to do and can actually index old data as new records, leading to either duplicates or a mixed bag of good/bad records.
The error in my question occurred when hitting one of the "bad", stale records from a previous reindex. Purging all ElasticSearch indices prior to reindexing resolved the issue.

Solr 6.4: Cannot unload core via API or Admin Panel

The problem is: I tried to replace a core creating a new one with a different name, swapping and then UNLOAD the old one, but it failed.
Now, even trying to clean everything manually (unloading the cores with the AdminPanel or via curl using deleteIndexDir=true&deleteInstanceDir=true and deleting the physical diretories of both cores, nothing works.
If I UNLOAD the cores using the AdminPanel, then I don't see the cores listed anymore. But the STATUS command still returns me this:
$ curl -XGET 'http://localhost:8983/solr/admin/cores?action=STATUS&core=mycore&wt=json'
{"responseHeader":{"status":0,"QTime":0},"initFailures":{},"status":{"mycore":{"name":"mycore","instanceDir":"/var/solr/data/mycore","dataDir":"data/","config":"solrconfig.xml","schema":"schema.xml","isLoaded":"false"}}}
But, if I try to UNLOAD the core via curl:
$ curl -XGET 'http://localhost:8983/solr/admin/cores?action=UNLOAD&deleteIndexDir=true&deleteInstanceDir=true&core=mycore&wt=json'
{"responseHeader":{"status":0,"QTime":0}}
and there is no effect. I still see the core listed in the AdminPanel, the STATUS returns exactly the same and of course if I want to access the cores errors start poping up telling me that solrconfig.xml doesn't exist. Of course, nothing exists.
I know if I restart Solr everything will be fine. But I cannot restart Solr in production whenever it gets dirty alone (and it does, very often).
Some time ago I made a comment here but I didn't get any useful reply.
Now, the real problem is that in production there are other cores working and to restart Solr it takes about half an hour, which is not ok at all.
So, the question is how to clean unloaded cores properly WITHOUT restarting Solr. Please before saying "no, it's not possible" try to understand the business requirement. It MUST be possible somehow. If you know the reason why it's not possible, let's start thinking together how could it be possible.
UPDATE
I'm adding here some errors I've found looking at the logs, I hope it helps:
Solr init error
Solr create error
Solr duplicate requestid error (my script tried twice using the same id)
Solr closing index writer error
Solr error opening new searcher
I've just noticed that the error opening searcher and the one creating the core are related, both have Caused by: java.nio.file.FileAlreadyExistsException: /var/solr/data/mycore/data/index/write.lock

Realtime API Fatal Network Error on Load

I've been seeing instances of 'fatal_network_error' popping up in production more frequently recently.
This particular error unfortunately isn't documented anywhere as to what the issue may be, so apart from retrying the doc in the client open I'm unsure of how to prevent something like this from consistently happening to people. In particular, one document [ID: 0B1es-bMybSeSb3hnSHdRUTNLSGc] that I have access to appears to be hitting this nearly 100% of the times I have opened it in Firefox (though it appears this is not exclusive to FF from other error reports).
After re-requesting the same document to be opened a second time (after handling the fatal error), the document load is successful. Looking at the network requests identical requests are made with the exact same params (rctype,recver,id,access_token). Additionally the response from the server is identical apart from what looks like a version hash and possibly a version number after the Document ID.
Is this a known issue? Are there workarounds?
I have shared the document publicly, please let me know if there is any other information that I can provide that may help track this down.

Service Broker error handling simulation

I work now in project in which multiple POSes should be synchronized to main server by using Server Broker feature. Now i prepare error handling for this solution and want to show to client how it works. That means i will prepare test scripts for every kind of errors and client runs it on test POS to see if it errors processed correctly.
We will use SQL Server 2008R2 with poison message = OFF.
Message type=XML (but inside can be different type of data, some nodes will contain BLOBs).
POSes will be outside of domen so transport will be secured (but no dialog encryption).
I divide errors on several sub-groups:
Logical error (e.g. string instead
of number) .It will be processed by
TRY-CATCH block on server side.It is
easy to simulate
Service Broker configuration error
(message or will be not returned or
cannot reach destination). I think
it can be handled by using SQl
Server Service Broker events and
simulation will be some kind of "bad
configuration" (SB GUID,service name
etc)
Transport error. This is when we
have a broken message. In fact it is
client opinion to test such kind of
error. I do not know if we have
secured transport level
(certificate) we are protected from
such kind of error. Another question
how can I simulate this.
Questions:
are there another error types?
is #2 error handling logic described good enough?
how to handle and simulate #3?
The second part of my article here goes into a discussion of Service Broker errors, how they occur and how to handle them. The important thing for you is to distinguish between two categories of errors:
recoverable: transport problems, most configuration errors like bad routing, an unreachable server. All these will result not in a SSB error, but in a delay. Messages will stay in transmission_queue expecting that the problem is transient and can be solved, including some configuration problems. Once the problem is solved, SSB will retry and the message gets delivered.
unrecoverable: these are problems SSB deems as non-recoverable, eg. a bad message format. In such a case the conversation will be aborted and both endpoints receive a Error message.
I also have an article Error Handling in Service Broker procedures that discusses some of the topics particular to exception handling in SSB activated context.
A final note: I strongly discourage you from turning poison message detection OFF. It is much better to disable the processing than to spin ad-nauseam w/o making progress because of a poison message.
As on the topic on how to simulate a corrupted message: is hard to simulate (you can try with setting up a port forwarder that lets all traffic pass by, but randomly corrupts some of it) but is rather pointless. All SSB traffic, even when in clear text, is cryptographically signed and any message corruption would result in an abrupt disconnect due to message signing validation failure.

Resources